Skip to main content

Iceberg on AWS: S3FileIO, Glue Catalog, and Performance Optimization Guide

· 20 min read
Cazpian Engineering
Platform Engineering Team

Iceberg on AWS: S3FileIO, Glue Catalog, and Performance Optimization Guide

If you are running Apache Iceberg on AWS, the single most impactful configuration decision you will make is your choice of FileIO implementation. Most teams start with HadoopFileIO and s3a:// paths because that is what their existing Hadoop-based stack already uses. It works, but it leaves significant performance on the table.

Iceberg's native S3FileIO was built from the ground up for object storage. It uses the AWS SDK v2 directly, skips the Hadoop filesystem abstraction entirely, and implements optimizations that s3a cannot — progressive multipart uploads, native bulk deletes, and zero serialization overhead. Teams that switch typically see faster writes, faster commits, and lower memory usage across the board.

This post covers everything you need to run Iceberg on AWS efficiently: why S3FileIO outperforms s3a, how to configure every critical property, how to avoid S3 throttling, how to set up Glue catalog correctly, and how to secure your tables with encryption and credential vending.

Why S3FileIO Outperforms HadoopFileIO (s3a)

Before diving into configuration, it is important to understand why S3FileIO exists and what problems it solves.

Detailed diagram comparing S3FileIO and HadoopFileIO architectures, showing progressive multipart uploads, ObjectStoreLocationProvider hash-prefix distribution, Glue catalog locking, and encryption options

The Problem with HadoopFileIO + s3a

When you use HadoopFileIO with s3a:// paths, Iceberg delegates all file operations to the Hadoop S3AFileSystem. This was originally designed for HDFS-like workloads and carries significant baggage when used with object storage:

1. Serialization overhead. Every Spark task must serialize the entire Hadoop Configuration object — which can be tens of kilobytes — and send it to each executor. For operations that touch many small metadata files, this serialization overhead can exceed the actual data processed.

2. Redundant filesystem contract checks. Hadoop FileSystem implementations enforce strict contracts: existence checks before writes, directory-vs-file deconfliction, rename-based commit protocols. On S3, these translate to extra HEAD and LIST requests that are unnecessary because Iceberg already uses fully addressable, unique file paths.

3. No native bulk delete. HadoopFileIO deletes files one at a time through the Hadoop FileSystem API. S3FileIO uses the S3 DeleteObjects API to delete up to 250 files in a single request — a critical advantage when expiring snapshots or compacting tables with thousands of data files.

4. No progressive upload. With s3a, a file must be fully written locally before uploading begins. S3FileIO streams parts to S3 as soon as each part is ready, reducing disk usage and upload latency.

5. Negative caching. The Hadoop FileSystem contract requires caching negative lookups (file-not-found), which can cause stale reads in concurrent workloads. S3FileIO does not require this behavior.

Side-by-Side Comparison

CapabilityHadoopFileIO (s3a)S3FileIO
AWS SDK versionSDK v1 (via Hadoop)SDK v2 (native)
Task serializationFull Hadoop Configuration (~10-50 KB)Lightweight properties (~1 KB)
Upload strategyBuffer full file, then uploadProgressive multipart (stream parts)
Bulk deleteOne file per API callUp to 250 files per API call
Existence checksRequired by contractSkipped (unique paths)
S3 Transfer AccelerationVia Hadoop configNative s3.acceleration-enabled
S3 Access PointsNot supportedNative s3.access-points.*
Credential vendingNot supportedNative via REST catalog
Directory bucket supportLimitedNative

Switching from s3a to S3FileIO

If you are currently using s3a:// paths, switching is straightforward. Set the FileIO implementation in your catalog configuration:

# Spark conf: switch to S3FileIO
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

# Your warehouse path stays the same — just use s3:// instead of s3a://
spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/warehouse/

Important: When using S3FileIO, use s3:// paths, not s3a://. The s3:// scheme tells Iceberg to use its native S3 client. If you use s3a://, Iceberg will fall back to HadoopFileIO regardless of your io-impl setting.

For existing tables that were created with s3a:// paths in their metadata, you can continue reading them with S3FileIO — it handles both URI schemes. But new tables should always use s3://.

S3FileIO Configuration Reference

S3FileIO has dozens of tunable properties. Here are the ones that matter in production, organized by category.

Progressive Multipart Upload

S3FileIO's headline feature is its progressive multipart upload algorithm. Instead of buffering an entire Parquet file locally and then uploading it, S3FileIO uploads each part to S3 as soon as it reaches the configured part size. This means:

  • Lower disk usage: Parts are deleted from the local staging directory as soon as they are uploaded.
  • Lower latency: Upload and write happen in parallel.
  • Better throughput: Multiple parts upload concurrently across threads.
# Part size: 32 MB default. Increase for large files, decrease for memory-constrained environments.
s3.multipart.size=33554432

# Number of upload threads. Defaults to the number of available processors.
s3.multipart.num-threads=4

# Factor at which single PUT switches to multipart. Default: 1.5
# A file smaller than (part_size * threshold_factor) uses a single PUT.
s3.multipart.threshold-factor=1.5

# Local directory for buffering parts before upload. Defaults to system temp dir.
s3.staging-dir=/tmp/iceberg-staging

When to tune these:

  • If your executors have limited disk space, reduce s3.multipart.size to 8 MB or use a staging directory on a larger volume.
  • If you are writing many large files (500 MB+), increase s3.multipart.num-threads to 8 or more to saturate your network bandwidth.
  • If you have very fast local NVMe, the defaults are fine — the progressive algorithm already minimizes disk residency.

Bulk Delete Configuration

When Iceberg expires snapshots or runs remove_orphan_files, it may need to delete thousands of files. S3FileIO batches these into multi-object delete requests:

# Number of files per delete batch. Default: 250.
# S3 API maximum is 1000, but 250 avoids throttling.
s3.delete.batch-size=250

# Number of threads for delete operations. Defaults to available processors.
s3.delete.num-threads=4

# Disable deletes entirely (useful for append-only audit tables).
s3.delete-enabled=true

Why is the default 250 instead of the S3 maximum of 1000? Because each key in a batch counts as one write operation against S3's per-prefix throttle limit of 3,500 TPS. At batch size 1000, a single delete call can consume nearly a third of your prefix budget. The 250 default was chosen after real-world throttling issues were observed at higher values.

S3 Transfer Acceleration

If your Spark cluster is not in the same region as your S3 bucket — or if you want faster uploads across long distances — enable S3 Transfer Acceleration:

# Enable S3 Transfer Acceleration for faster cross-region uploads
s3.acceleration-enabled=true

This routes uploads through AWS CloudFront edge locations. You must first enable Transfer Acceleration on the S3 bucket itself via the AWS console or CLI. There is an additional per-GB cost, but for cross-region workloads the latency improvement is significant.

Checksum Validation

# Enable eTag checksum validation for PUT and multipart uploads. Default: true.
s3.checksum-enabled=true

This is enabled by default and validates data integrity on every write. There is no performance reason to disable it — keep it on.

Storage Class

Control the S3 storage class for all files written by Iceberg:

# Write files directly to Intelligent-Tiering to optimize storage costs
s3.write.storage-class=INTELLIGENT_TIERING

Supported values: STANDARD, REDUCED_REDUNDANCY, STANDARD_IA, ONEZONE_IA, INTELLIGENT_TIERING, GLACIER, DEEP_ARCHIVE, GLACIER_IR.

Recommendation: Use INTELLIGENT_TIERING for data files on tables with unpredictable access patterns. It automatically moves objects between frequent and infrequent access tiers with zero retrieval fees and no performance impact. For tables that are written once and rarely queried (audit logs, compliance archives), STANDARD_IA saves roughly 40% on storage costs.

S3 Object Tags

S3FileIO can tag every object it writes, which is invaluable for cost allocation and lifecycle management:

# Tag all written objects with custom key-value pairs
s3.write.tags.team=data-platform
s3.write.tags.environment=production

# Automatically tag objects with the Iceberg table name
s3.write.table-tag-enabled=true

# Automatically tag objects with the Iceberg namespace
s3.write.namespace-tag-enabled=true

With table-tag and namespace-tag enabled, every Parquet file, manifest, and metadata file is tagged with iceberg.table=<table_name> and iceberg.namespace=<namespace>. You can then use S3 Storage Lens and AWS Cost Explorer to break down storage costs by table — something that is otherwise impossible with flat S3 prefix-based cost allocation.

Soft-delete tags: You can also tag files during deletion instead of actually deleting them:

# Instead of deleting, tag files for deferred cleanup
s3.delete.tags.status=to-be-deleted
s3.delete.tags.deleted-by=iceberg-maintenance
s3.delete-enabled=false

This pattern is useful for compliance scenarios where you need a grace period before permanent deletion. A separate S3 Lifecycle rule can then permanently delete objects with the status=to-be-deleted tag after a retention period.

Retry Configuration

# Number of retries for S3 operations. Default: 5.
s3.retry.num-retries=5

# Minimum retry wait time in milliseconds. Default: 100.
s3.retry.min-wait-ms=100

# Maximum retry wait time in milliseconds. Default: 13000.
s3.retry.max-wait-ms=13000

The defaults use exponential backoff and are well-suited for most workloads. Increase s3.retry.num-retries to 10 if you are running in a region with occasional S3 throttling.

Client Initialization

# Pre-initialize the S3 client at catalog load time instead of first use
s3.preload-client-enabled=true

By default, S3FileIO initializes the S3 client lazily on first use. Setting s3.preload-client-enabled=true creates the client immediately when the catalog is loaded. This avoids a cold-start latency hit on the first query — useful for short-lived Spark applications or Lambda-based workloads where every second counts.

Avoiding S3 Throttling with ObjectStoreLocationProvider

S3 throttles requests at 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix. With a traditional Hive-style layout, all data files for a given partition share the same prefix:

s3://bucket/warehouse/db/events/event_date=2026-02-12/
00001-file-abc.parquet
00002-file-def.parquet
00003-file-ghi.parquet
...

If you are writing thousands of files per partition (common in streaming or high-volume batch), you will hit the prefix throttle.

The Solution: ObjectStoreLocationProvider

Iceberg's ObjectStoreLocationProvider generates a deterministic hash for each file and prepends it to the path, distributing files across many S3 prefixes automatically:

s3://bucket/warehouse/
a1b2c3d4/db/events/event_date=2026-02-12/00001-file-abc.parquet
e5f6g7h8/db/events/event_date=2026-02-12/00002-file-def.parquet
i9j0k1l2/db/events/event_date=2026-02-12/00003-file-ghi.parquet

Each file now has a unique prefix, spreading load across S3's distributed infrastructure.

Enable it per table:

CREATE TABLE analytics.events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
payload STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
'write.object-storage.enabled' = 'true',
'write.data.path' = 's3://my-bucket/warehouse/db/events/data'
);

Or alter an existing table:

ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.object-storage.enabled' = 'true',
'write.data.path' = 's3://my-bucket/warehouse/db/events/data'
);

Key points:

  • write.object-storage.enabled activates the hash-prefix distribution.
  • write.data.path specifies the root location under which hashed paths are generated. If not set, the table location is used.
  • This does not affect query performance — Iceberg reads from manifest files which contain the full path to every data file, so the hashed prefixes are transparent to readers.
  • This does not affect partition pruning — pruning happens at the metadata layer, not the filesystem layer.

When to use it: Always enable it for tables that write more than 1,000 files per hour. For low-volume tables (a few files per day), it is unnecessary but harmless.

Glue Catalog Configuration

AWS Glue Data Catalog is the most common Iceberg catalog on AWS. Here is how to configure it correctly.

Basic Setup

# Use Glue as the Iceberg catalog
spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.glue_catalog.warehouse=s3://my-bucket/warehouse/
spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Note: GlueCatalog uses S3FileIO by default, so io-impl is optional here — but it is good practice to be explicit.

Optimistic Locking

Glue 4.0+ uses optimistic locking by default to guarantee atomic commits. When a Spark job commits a new snapshot, Glue checks that the table's version ID has not changed since the job read the metadata. If another job committed in between, the commit fails and Iceberg retries.

This replaces the older DynamoDB lock manager approach. If you are on Glue 3.0 or using Iceberg < 1.0, you need the DynamoDB lock:

# Only needed for Glue 3.0 / Iceberg < 1.0
spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager
spark.sql.catalog.glue_catalog.lock.table=iceberg_locks

For Glue 4.0+, no lock configuration is needed — optimistic locking is automatic. Just ensure your AWS SDK version is >= 2.17.131.

Skip Archive (Critical for Streaming)

# Skip archiving old Glue table versions. Default: true (skipped).
spark.sql.catalog.glue_catalog.glue.skip-archive=true

When glue.skip-archive is false, every Iceberg commit creates a new Glue table version, and older versions are archived. For streaming ingestion that commits every minute, this generates thousands of Glue table versions per day, increasing Glue API costs and slowing down GetTable calls. Keep this at true (the default) unless you have a specific compliance reason to archive Glue versions.

Skip Name Validation

# Skip Glue naming validation (allow non-Hive-compatible names)
spark.sql.catalog.glue_catalog.glue.skip-name-validation=true

Glue enforces Hive-compatible naming rules by default (lowercase, no special characters). Set this to true if your existing naming conventions use characters that Glue would reject. Otherwise, leave it at the default.

Glue Catalog ID (Cross-Account)

# Use a specific AWS account's Glue catalog (for cross-account access)
spark.sql.catalog.glue_catalog.glue.id=123456789012

This is essential for cross-account data mesh architectures where a central account hosts the Glue catalog and multiple producer accounts write to it.

Server-Side Encryption

S3FileIO supports all three S3 server-side encryption modes. Configure encryption at the catalog level so every table inherits it automatically:

SSE-S3 (Amazon-Managed Keys)

# Simplest option: Amazon manages the keys
s3.sse.type=s3

Zero configuration beyond this. Amazon handles key rotation automatically. This is the right choice for most teams.

SSE-KMS (AWS KMS-Managed Keys)

# Use a specific KMS key for all Iceberg writes
s3.sse.type=kms
s3.sse.key=arn:aws:kms:us-east-1:123456789012:key/your-key-id

Use SSE-KMS when you need:

  • Customer-managed key rotation policies.
  • CloudTrail audit logging of every key usage.
  • Cross-account access control via KMS key policies.

Cost note: Each S3 PUT triggers a KMS API call. For high-volume tables writing thousands of files per hour, this can add significant KMS costs (~$0.03 per 10,000 requests). Consider using S3 Bucket Keys (enabled at the bucket level) to reduce KMS calls by up to 99%.

DSSE-KMS (Dual-Layer Encryption)

# Dual-layer server-side encryption (FIPS compliance)
s3.sse.type=dsse-kms
s3.sse.key=arn:aws:kms:us-east-1:123456789012:key/your-key-id

DSSE-KMS applies two layers of encryption. Required for certain government and financial compliance standards (FIPS 140-2 Level 3).

SSE-C (Customer-Provided Keys)

# You provide the encryption key directly
s3.sse.type=custom
s3.sse.key=<base64-encoded-AES-256-key>
s3.sse.md5=<base64-encoded-MD5-digest-of-key>

SSE-C means you manage the key yourself. If you lose the key, the data is unrecoverable. Use this only if regulatory requirements mandate that AWS never stores your encryption key.

Credential Vending and Remote Signing

For multi-tenant architectures where different users or teams should only access specific tables, Iceberg supports two approaches through the REST catalog protocol.

Credential Vending

With credential vending, the catalog issues short-lived, scoped AWS credentials for each table access:

  1. A query engine requests access to analytics.events.
  2. The REST catalog generates temporary STS credentials scoped only to that table's S3 prefix.
  3. The engine uses those credentials to read/write data files.
  4. Credentials expire after minutes, enforcing the principle of least privilege.

This is supported by catalogs like Polaris, Snowflake, Dremio, and Nessie when running in REST mode. No special S3FileIO configuration is needed — the catalog pushes the credential configuration to the client automatically.

Remote Signing

An alternative to credential vending: instead of giving the client temporary credentials, the catalog signs each S3 request on the server side:

# Enable remote signing — the catalog signs all S3 requests
s3.remote-signing-enabled=true

With remote signing, the query engine never sees any AWS credentials. Every S3 GET, PUT, and DELETE request is sent to the catalog for signing before being forwarded to S3. This provides the strongest security posture because no storage credentials ever leave the catalog server.

Trade-off: Remote signing adds a network round-trip per S3 request, which increases latency. Use it for environments where security requirements outweigh performance (e.g., regulated industries, multi-tenant SaaS platforms).

S3 Access Points

For complex networking setups — VPC endpoints, cross-account access, or firewall rules per table — use S3 Access Points:

# Map a specific bucket to an S3 Access Point ARN
s3.access-points.my-bucket=arn:aws:s3:us-east-1:123456789012:accesspoint/my-access-point

S3FileIO will route all requests for s3://my-bucket/... through the specified Access Point. This lets you apply different network policies and IAM policies per table or per namespace without changing the table's S3 path.

Cross-Region Access

# Allow reading from Access Points in a different region than the S3 client
s3.use-arn-region-enabled=true

# Allow cross-region bucket access (S3 Multi-Region Access Points)
s3.cross-region-access-enabled=true

Enable these for disaster recovery or multi-region query architectures where your Spark cluster may be in a different region than your data.

Custom S3 Client Factory

For advanced use cases — custom retry policies, proxy configuration, request interceptors, or integration with corporate credential providers — you can provide a custom S3 client factory:

# Use a custom factory class for building the S3 client
s3.client.factory=com.mycompany.iceberg.CustomS3ClientFactory

Your factory must implement S3FileIOAwsClientFactory:

package com.mycompany.iceberg;

import org.apache.iceberg.aws.s3.S3FileIOAwsClientFactory;
import software.amazon.awssdk.services.s3.S3Client;
import java.util.Map;

public class CustomS3ClientFactory implements S3FileIOAwsClientFactory {

@Override
public S3Client s3() {
return S3Client.builder()
.overrideConfiguration(config -> config
.addExecutionInterceptor(new RequestLoggingInterceptor())
.retryPolicy(retryPolicy -> retryPolicy.numRetries(10))
)
.httpClientBuilder(/* custom HTTP client config */)
.build();
}

@Override
public void initialize(Map<String, String> properties) {
// Read custom properties from catalog config
}
}

Common use cases for a custom factory:

  • Corporate HTTP proxies that require authentication.
  • Custom metrics collection on S3 request latency and error rates.
  • Integrating with a secrets manager for credential retrieval.
  • Adding request tracing headers for observability.

Package your factory class into a JAR and add it to Spark's classpath:

spark-submit \
--jars /path/to/custom-s3-factory.jar \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.s3.client.factory=com.mycompany.iceberg.CustomS3ClientFactory \
your-application.jar

Complete Spark Configuration Template

Here is a production-ready Spark configuration that puts everything together:

# ── Catalog: Glue ──
spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.iceberg.warehouse=s3://production-bucket/warehouse/
spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.iceberg.glue.skip-archive=true

# ── S3FileIO: Uploads ──
spark.sql.catalog.iceberg.s3.multipart.size=33554432
spark.sql.catalog.iceberg.s3.multipart.num-threads=4
spark.sql.catalog.iceberg.s3.checksum-enabled=true
spark.sql.catalog.iceberg.s3.preload-client-enabled=true

# ── S3FileIO: Deletes ──
spark.sql.catalog.iceberg.s3.delete.batch-size=250
spark.sql.catalog.iceberg.s3.delete.num-threads=4

# ── S3FileIO: Encryption ──
spark.sql.catalog.iceberg.s3.sse.type=s3

# ── S3FileIO: Tags (for cost allocation) ──
spark.sql.catalog.iceberg.s3.write.table-tag-enabled=true
spark.sql.catalog.iceberg.s3.write.namespace-tag-enabled=true
spark.sql.catalog.iceberg.s3.write.tags.environment=production

# ── S3FileIO: Storage class ──
spark.sql.catalog.iceberg.s3.write.storage-class=INTELLIGENT_TIERING

# ── S3FileIO: Retries ──
spark.sql.catalog.iceberg.s3.retry.num-retries=5
spark.sql.catalog.iceberg.s3.retry.min-wait-ms=100
spark.sql.catalog.iceberg.s3.retry.max-wait-ms=13000

Table-Level Overrides

Remember that ObjectStoreLocationProvider is set per table, not per catalog:

CREATE TABLE iceberg.analytics.high_volume_events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
event_type STRING,
payload STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
'write.object-storage.enabled' = 'true',
'write.data.path' = 's3://production-bucket/warehouse/analytics/high_volume_events/data',
'write.distribution-mode' = 'hash',
'write.parquet.compression-codec' = 'zstd'
);

S3FileIO Properties Quick Reference

Here is every S3FileIO property in one table for quick reference:

PropertyDefaultDescription
s3.endpointCustom S3 endpoint URL
s3.path-style-accessfalseUse path-style access (for MinIO, LocalStack)
s3.access-key-idStatic access key (prefer IAM roles instead)
s3.secret-access-keyStatic secret key (prefer IAM roles instead)
s3.session-tokenStatic session token
s3.multipart.size32 MBPart size for multipart uploads
s3.multipart.num-threadsCPU countThreads for uploading parts
s3.multipart.threshold-factor1.5Factor to switch from PUT to multipart
s3.staging-dirSystem tempLocal staging directory for parts
s3.checksum-enabledtrueeTag validation on writes
s3.delete.batch-size250Files per bulk delete request
s3.delete.num-threadsCPU countThreads for delete operations
s3.delete-enabledtrueWhether deletes are allowed
s3.delete.tags.*Tags applied before soft-delete
s3.write.tags.*Tags applied during writes
s3.write.table-tag-enabledfalseAuto-tag with table name
s3.write.namespace-tag-enabledfalseAuto-tag with namespace
s3.write.storage-classS3 storage class for writes
s3.sse.typenoneEncryption: none, s3, kms, dsse-kms, custom
s3.sse.keyKMS key ARN or base64 AES key
s3.sse.md5MD5 digest for SSE-C
s3.acceleration-enabledfalseS3 Transfer Acceleration
s3.dualstack-enabledfalseIPv4/IPv6 dual-stack endpoints
s3.cross-region-access-enabledfalseCross-region bucket access
s3.use-arn-region-enabledfalseCross-region Access Point calls
s3.access-points.*Bucket-to-Access-Point mapping
s3.remote-signing-enabledfalseRemote request signing
s3.preload-client-enabledfalseInitialize client eagerly
s3.aclCanned ACL for writes
s3.client.factoryCustom S3 client factory class
s3.retry.num-retries5Max retries for S3 operations
s3.retry.min-wait-ms100Min backoff between retries
s3.retry.max-wait-ms13000Max backoff between retries
s3.access-grants.enabledfalseS3 Access Grants integration
s3.access-grants.fallback-to-iamtrueFallback to IAM if Access Grants denies

Production Checklist

Use this checklist when setting up Iceberg on AWS:

FileIO:

  • Set io-impl to org.apache.iceberg.aws.s3.S3FileIO
  • Use s3:// paths, not s3a://
  • Enable s3.preload-client-enabled for short-lived applications
  • Set s3.checksum-enabled=true (default)

Throttling prevention:

  • Enable write.object-storage.enabled on high-volume tables
  • Keep s3.delete.batch-size at 250 (default)
  • Monitor S3 503 SlowDown errors in CloudWatch

Encryption:

  • Set s3.sse.type to at least s3 for encryption at rest
  • Use kms with S3 Bucket Keys for KMS cost optimization
  • Audit encryption settings with S3 Storage Lens

Cost management:

  • Enable s3.write.table-tag-enabled and s3.write.namespace-tag-enabled
  • Set s3.write.storage-class=INTELLIGENT_TIERING for variable-access tables
  • Use S3 Lifecycle rules to transition old snapshots to cheaper tiers

Catalog (Glue):

  • Set glue.skip-archive=true for streaming tables
  • Ensure AWS SDK >= 2.17.131 for optimistic locking
  • Remove DynamoDB lock manager config if using Glue 4.0+

Security:

  • Use IAM roles, not static credentials (s3.access-key-id)
  • Evaluate credential vending for multi-tenant deployments
  • Evaluate s3.remote-signing-enabled for regulated environments
  • Review S3 Access Points for network segmentation

How Cazpian Handles This

On Cazpian, S3FileIO is the default — every Iceberg table uses it out of the box. Cazpian's managed Spark clusters come pre-configured with optimized S3FileIO settings: progressive multipart uploads, Glue catalog with optimistic locking, and SSE-S3 encryption enabled by default. ObjectStoreLocationProvider is automatically enabled for tables that exceed configurable write thresholds. You focus on your data — we handle the infrastructure.

What's Next

This post covered the storage and catalog layer. For related topics, see our other posts in this series: