Iceberg on AWS: S3FileIO, Glue Catalog, and Performance Optimization Guide
If you are running Apache Iceberg on AWS, the single most impactful configuration decision you will make is your choice of FileIO implementation. Most teams start with HadoopFileIO and s3a:// paths because that is what their existing Hadoop-based stack already uses. It works, but it leaves significant performance on the table.
Iceberg's native S3FileIO was built from the ground up for object storage. It uses the AWS SDK v2 directly, skips the Hadoop filesystem abstraction entirely, and implements optimizations that s3a cannot — progressive multipart uploads, native bulk deletes, and zero serialization overhead. Teams that switch typically see faster writes, faster commits, and lower memory usage across the board.
This post covers everything you need to run Iceberg on AWS efficiently: why S3FileIO outperforms s3a, how to configure every critical property, how to avoid S3 throttling, how to set up Glue catalog correctly, and how to secure your tables with encryption and credential vending.
Why S3FileIO Outperforms HadoopFileIO (s3a)
Before diving into configuration, it is important to understand why S3FileIO exists and what problems it solves.
The Problem with HadoopFileIO + s3a
When you use HadoopFileIO with s3a:// paths, Iceberg delegates all file operations to the Hadoop S3AFileSystem. This was originally designed for HDFS-like workloads and carries significant baggage when used with object storage:
1. Serialization overhead. Every Spark task must serialize the entire Hadoop Configuration object — which can be tens of kilobytes — and send it to each executor. For operations that touch many small metadata files, this serialization overhead can exceed the actual data processed.
2. Redundant filesystem contract checks. Hadoop FileSystem implementations enforce strict contracts: existence checks before writes, directory-vs-file deconfliction, rename-based commit protocols. On S3, these translate to extra HEAD and LIST requests that are unnecessary because Iceberg already uses fully addressable, unique file paths.
3. No native bulk delete. HadoopFileIO deletes files one at a time through the Hadoop FileSystem API. S3FileIO uses the S3 DeleteObjects API to delete up to 250 files in a single request — a critical advantage when expiring snapshots or compacting tables with thousands of data files.
4. No progressive upload. With s3a, a file must be fully written locally before uploading begins. S3FileIO streams parts to S3 as soon as each part is ready, reducing disk usage and upload latency.
5. Negative caching. The Hadoop FileSystem contract requires caching negative lookups (file-not-found), which can cause stale reads in concurrent workloads. S3FileIO does not require this behavior.
Side-by-Side Comparison
| Capability | HadoopFileIO (s3a) | S3FileIO |
|---|---|---|
| AWS SDK version | SDK v1 (via Hadoop) | SDK v2 (native) |
| Task serialization | Full Hadoop Configuration (~10-50 KB) | Lightweight properties (~1 KB) |
| Upload strategy | Buffer full file, then upload | Progressive multipart (stream parts) |
| Bulk delete | One file per API call | Up to 250 files per API call |
| Existence checks | Required by contract | Skipped (unique paths) |
| S3 Transfer Acceleration | Via Hadoop config | Native s3.acceleration-enabled |
| S3 Access Points | Not supported | Native s3.access-points.* |
| Credential vending | Not supported | Native via REST catalog |
| Directory bucket support | Limited | Native |
Switching from s3a to S3FileIO
If you are currently using s3a:// paths, switching is straightforward. Set the FileIO implementation in your catalog configuration:
# Spark conf: switch to S3FileIO
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
# Your warehouse path stays the same — just use s3:// instead of s3a://
spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/warehouse/
Important: When using S3FileIO, use s3:// paths, not s3a://. The s3:// scheme tells Iceberg to use its native S3 client. If you use s3a://, Iceberg will fall back to HadoopFileIO regardless of your io-impl setting.
For existing tables that were created with s3a:// paths in their metadata, you can continue reading them with S3FileIO — it handles both URI schemes. But new tables should always use s3://.
S3FileIO Configuration Reference
S3FileIO has dozens of tunable properties. Here are the ones that matter in production, organized by category.
Progressive Multipart Upload
S3FileIO's headline feature is its progressive multipart upload algorithm. Instead of buffering an entire Parquet file locally and then uploading it, S3FileIO uploads each part to S3 as soon as it reaches the configured part size. This means:
- Lower disk usage: Parts are deleted from the local staging directory as soon as they are uploaded.
- Lower latency: Upload and write happen in parallel.
- Better throughput: Multiple parts upload concurrently across threads.
# Part size: 32 MB default. Increase for large files, decrease for memory-constrained environments.
s3.multipart.size=33554432
# Number of upload threads. Defaults to the number of available processors.
s3.multipart.num-threads=4
# Factor at which single PUT switches to multipart. Default: 1.5
# A file smaller than (part_size * threshold_factor) uses a single PUT.
s3.multipart.threshold-factor=1.5
# Local directory for buffering parts before upload. Defaults to system temp dir.
s3.staging-dir=/tmp/iceberg-staging
When to tune these:
- If your executors have limited disk space, reduce
s3.multipart.sizeto 8 MB or use a staging directory on a larger volume. - If you are writing many large files (500 MB+), increase
s3.multipart.num-threadsto 8 or more to saturate your network bandwidth. - If you have very fast local NVMe, the defaults are fine — the progressive algorithm already minimizes disk residency.
Bulk Delete Configuration
When Iceberg expires snapshots or runs remove_orphan_files, it may need to delete thousands of files. S3FileIO batches these into multi-object delete requests:
# Number of files per delete batch. Default: 250.
# S3 API maximum is 1000, but 250 avoids throttling.
s3.delete.batch-size=250
# Number of threads for delete operations. Defaults to available processors.
s3.delete.num-threads=4
# Disable deletes entirely (useful for append-only audit tables).
s3.delete-enabled=true
Why is the default 250 instead of the S3 maximum of 1000? Because each key in a batch counts as one write operation against S3's per-prefix throttle limit of 3,500 TPS. At batch size 1000, a single delete call can consume nearly a third of your prefix budget. The 250 default was chosen after real-world throttling issues were observed at higher values.
S3 Transfer Acceleration
If your Spark cluster is not in the same region as your S3 bucket — or if you want faster uploads across long distances — enable S3 Transfer Acceleration:
# Enable S3 Transfer Acceleration for faster cross-region uploads
s3.acceleration-enabled=true
This routes uploads through AWS CloudFront edge locations. You must first enable Transfer Acceleration on the S3 bucket itself via the AWS console or CLI. There is an additional per-GB cost, but for cross-region workloads the latency improvement is significant.
Checksum Validation
# Enable eTag checksum validation for PUT and multipart uploads. Default: true.
s3.checksum-enabled=true
This is enabled by default and validates data integrity on every write. There is no performance reason to disable it — keep it on.
Storage Class
Control the S3 storage class for all files written by Iceberg:
# Write files directly to Intelligent-Tiering to optimize storage costs
s3.write.storage-class=INTELLIGENT_TIERING
Supported values: STANDARD, REDUCED_REDUNDANCY, STANDARD_IA, ONEZONE_IA, INTELLIGENT_TIERING, GLACIER, DEEP_ARCHIVE, GLACIER_IR.
Recommendation: Use INTELLIGENT_TIERING for data files on tables with unpredictable access patterns. It automatically moves objects between frequent and infrequent access tiers with zero retrieval fees and no performance impact. For tables that are written once and rarely queried (audit logs, compliance archives), STANDARD_IA saves roughly 40% on storage costs.
S3 Object Tags
S3FileIO can tag every object it writes, which is invaluable for cost allocation and lifecycle management:
# Tag all written objects with custom key-value pairs
s3.write.tags.team=data-platform
s3.write.tags.environment=production
# Automatically tag objects with the Iceberg table name
s3.write.table-tag-enabled=true
# Automatically tag objects with the Iceberg namespace
s3.write.namespace-tag-enabled=true
With table-tag and namespace-tag enabled, every Parquet file, manifest, and metadata file is tagged with iceberg.table=<table_name> and iceberg.namespace=<namespace>. You can then use S3 Storage Lens and AWS Cost Explorer to break down storage costs by table — something that is otherwise impossible with flat S3 prefix-based cost allocation.
Soft-delete tags: You can also tag files during deletion instead of actually deleting them:
# Instead of deleting, tag files for deferred cleanup
s3.delete.tags.status=to-be-deleted
s3.delete.tags.deleted-by=iceberg-maintenance
s3.delete-enabled=false
This pattern is useful for compliance scenarios where you need a grace period before permanent deletion. A separate S3 Lifecycle rule can then permanently delete objects with the status=to-be-deleted tag after a retention period.
Retry Configuration
# Number of retries for S3 operations. Default: 5.
s3.retry.num-retries=5
# Minimum retry wait time in milliseconds. Default: 100.
s3.retry.min-wait-ms=100
# Maximum retry wait time in milliseconds. Default: 13000.
s3.retry.max-wait-ms=13000
The defaults use exponential backoff and are well-suited for most workloads. Increase s3.retry.num-retries to 10 if you are running in a region with occasional S3 throttling.
Client Initialization
# Pre-initialize the S3 client at catalog load time instead of first use
s3.preload-client-enabled=true
By default, S3FileIO initializes the S3 client lazily on first use. Setting s3.preload-client-enabled=true creates the client immediately when the catalog is loaded. This avoids a cold-start latency hit on the first query — useful for short-lived Spark applications or Lambda-based workloads where every second counts.
Avoiding S3 Throttling with ObjectStoreLocationProvider
S3 throttles requests at 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix. With a traditional Hive-style layout, all data files for a given partition share the same prefix:
s3://bucket/warehouse/db/events/event_date=2026-02-12/
00001-file-abc.parquet
00002-file-def.parquet
00003-file-ghi.parquet
...
If you are writing thousands of files per partition (common in streaming or high-volume batch), you will hit the prefix throttle.
The Solution: ObjectStoreLocationProvider
Iceberg's ObjectStoreLocationProvider generates a deterministic hash for each file and prepends it to the path, distributing files across many S3 prefixes automatically:
s3://bucket/warehouse/
a1b2c3d4/db/events/event_date=2026-02-12/00001-file-abc.parquet
e5f6g7h8/db/events/event_date=2026-02-12/00002-file-def.parquet
i9j0k1l2/db/events/event_date=2026-02-12/00003-file-ghi.parquet
Each file now has a unique prefix, spreading load across S3's distributed infrastructure.
Enable it per table:
CREATE TABLE analytics.events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
payload STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
'write.object-storage.enabled' = 'true',
'write.data.path' = 's3://my-bucket/warehouse/db/events/data'
);
Or alter an existing table:
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.object-storage.enabled' = 'true',
'write.data.path' = 's3://my-bucket/warehouse/db/events/data'
);
Key points:
write.object-storage.enabledactivates the hash-prefix distribution.write.data.pathspecifies the root location under which hashed paths are generated. If not set, the table location is used.- This does not affect query performance — Iceberg reads from manifest files which contain the full path to every data file, so the hashed prefixes are transparent to readers.
- This does not affect partition pruning — pruning happens at the metadata layer, not the filesystem layer.
When to use it: Always enable it for tables that write more than 1,000 files per hour. For low-volume tables (a few files per day), it is unnecessary but harmless.
Glue Catalog Configuration
AWS Glue Data Catalog is the most common Iceberg catalog on AWS. Here is how to configure it correctly.
Basic Setup
# Use Glue as the Iceberg catalog
spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.glue_catalog.warehouse=s3://my-bucket/warehouse/
spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
Note: GlueCatalog uses S3FileIO by default, so io-impl is optional here — but it is good practice to be explicit.
Optimistic Locking
Glue 4.0+ uses optimistic locking by default to guarantee atomic commits. When a Spark job commits a new snapshot, Glue checks that the table's version ID has not changed since the job read the metadata. If another job committed in between, the commit fails and Iceberg retries.
This replaces the older DynamoDB lock manager approach. If you are on Glue 3.0 or using Iceberg < 1.0, you need the DynamoDB lock:
# Only needed for Glue 3.0 / Iceberg < 1.0
spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager
spark.sql.catalog.glue_catalog.lock.table=iceberg_locks
For Glue 4.0+, no lock configuration is needed — optimistic locking is automatic. Just ensure your AWS SDK version is >= 2.17.131.
Skip Archive (Critical for Streaming)
# Skip archiving old Glue table versions. Default: true (skipped).
spark.sql.catalog.glue_catalog.glue.skip-archive=true
When glue.skip-archive is false, every Iceberg commit creates a new Glue table version, and older versions are archived. For streaming ingestion that commits every minute, this generates thousands of Glue table versions per day, increasing Glue API costs and slowing down GetTable calls. Keep this at true (the default) unless you have a specific compliance reason to archive Glue versions.
Skip Name Validation
# Skip Glue naming validation (allow non-Hive-compatible names)
spark.sql.catalog.glue_catalog.glue.skip-name-validation=true
Glue enforces Hive-compatible naming rules by default (lowercase, no special characters). Set this to true if your existing naming conventions use characters that Glue would reject. Otherwise, leave it at the default.
Glue Catalog ID (Cross-Account)
# Use a specific AWS account's Glue catalog (for cross-account access)
spark.sql.catalog.glue_catalog.glue.id=123456789012
This is essential for cross-account data mesh architectures where a central account hosts the Glue catalog and multiple producer accounts write to it.
Server-Side Encryption
S3FileIO supports all three S3 server-side encryption modes. Configure encryption at the catalog level so every table inherits it automatically:
SSE-S3 (Amazon-Managed Keys)
# Simplest option: Amazon manages the keys
s3.sse.type=s3
Zero configuration beyond this. Amazon handles key rotation automatically. This is the right choice for most teams.
SSE-KMS (AWS KMS-Managed Keys)
# Use a specific KMS key for all Iceberg writes
s3.sse.type=kms
s3.sse.key=arn:aws:kms:us-east-1:123456789012:key/your-key-id
Use SSE-KMS when you need:
- Customer-managed key rotation policies.
- CloudTrail audit logging of every key usage.
- Cross-account access control via KMS key policies.
Cost note: Each S3 PUT triggers a KMS API call. For high-volume tables writing thousands of files per hour, this can add significant KMS costs (~$0.03 per 10,000 requests). Consider using S3 Bucket Keys (enabled at the bucket level) to reduce KMS calls by up to 99%.
DSSE-KMS (Dual-Layer Encryption)
# Dual-layer server-side encryption (FIPS compliance)
s3.sse.type=dsse-kms
s3.sse.key=arn:aws:kms:us-east-1:123456789012:key/your-key-id
DSSE-KMS applies two layers of encryption. Required for certain government and financial compliance standards (FIPS 140-2 Level 3).
SSE-C (Customer-Provided Keys)
# You provide the encryption key directly
s3.sse.type=custom
s3.sse.key=<base64-encoded-AES-256-key>
s3.sse.md5=<base64-encoded-MD5-digest-of-key>
SSE-C means you manage the key yourself. If you lose the key, the data is unrecoverable. Use this only if regulatory requirements mandate that AWS never stores your encryption key.
Credential Vending and Remote Signing
For multi-tenant architectures where different users or teams should only access specific tables, Iceberg supports two approaches through the REST catalog protocol.
Credential Vending
With credential vending, the catalog issues short-lived, scoped AWS credentials for each table access:
- A query engine requests access to
analytics.events. - The REST catalog generates temporary STS credentials scoped only to that table's S3 prefix.
- The engine uses those credentials to read/write data files.
- Credentials expire after minutes, enforcing the principle of least privilege.
This is supported by catalogs like Polaris, Snowflake, Dremio, and Nessie when running in REST mode. No special S3FileIO configuration is needed — the catalog pushes the credential configuration to the client automatically.
Remote Signing
An alternative to credential vending: instead of giving the client temporary credentials, the catalog signs each S3 request on the server side:
# Enable remote signing — the catalog signs all S3 requests
s3.remote-signing-enabled=true
With remote signing, the query engine never sees any AWS credentials. Every S3 GET, PUT, and DELETE request is sent to the catalog for signing before being forwarded to S3. This provides the strongest security posture because no storage credentials ever leave the catalog server.
Trade-off: Remote signing adds a network round-trip per S3 request, which increases latency. Use it for environments where security requirements outweigh performance (e.g., regulated industries, multi-tenant SaaS platforms).
S3 Access Points
For complex networking setups — VPC endpoints, cross-account access, or firewall rules per table — use S3 Access Points:
# Map a specific bucket to an S3 Access Point ARN
s3.access-points.my-bucket=arn:aws:s3:us-east-1:123456789012:accesspoint/my-access-point
S3FileIO will route all requests for s3://my-bucket/... through the specified Access Point. This lets you apply different network policies and IAM policies per table or per namespace without changing the table's S3 path.
Cross-Region Access
# Allow reading from Access Points in a different region than the S3 client
s3.use-arn-region-enabled=true
# Allow cross-region bucket access (S3 Multi-Region Access Points)
s3.cross-region-access-enabled=true
Enable these for disaster recovery or multi-region query architectures where your Spark cluster may be in a different region than your data.
Custom S3 Client Factory
For advanced use cases — custom retry policies, proxy configuration, request interceptors, or integration with corporate credential providers — you can provide a custom S3 client factory:
# Use a custom factory class for building the S3 client
s3.client.factory=com.mycompany.iceberg.CustomS3ClientFactory
Your factory must implement S3FileIOAwsClientFactory:
package com.mycompany.iceberg;
import org.apache.iceberg.aws.s3.S3FileIOAwsClientFactory;
import software.amazon.awssdk.services.s3.S3Client;
import java.util.Map;
public class CustomS3ClientFactory implements S3FileIOAwsClientFactory {
@Override
public S3Client s3() {
return S3Client.builder()
.overrideConfiguration(config -> config
.addExecutionInterceptor(new RequestLoggingInterceptor())
.retryPolicy(retryPolicy -> retryPolicy.numRetries(10))
)
.httpClientBuilder(/* custom HTTP client config */)
.build();
}
@Override
public void initialize(Map<String, String> properties) {
// Read custom properties from catalog config
}
}
Common use cases for a custom factory:
- Corporate HTTP proxies that require authentication.
- Custom metrics collection on S3 request latency and error rates.
- Integrating with a secrets manager for credential retrieval.
- Adding request tracing headers for observability.
Package your factory class into a JAR and add it to Spark's classpath:
spark-submit \
--jars /path/to/custom-s3-factory.jar \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.s3.client.factory=com.mycompany.iceberg.CustomS3ClientFactory \
your-application.jar
Complete Spark Configuration Template
Here is a production-ready Spark configuration that puts everything together:
# ── Catalog: Glue ──
spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.iceberg.warehouse=s3://production-bucket/warehouse/
spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.iceberg.glue.skip-archive=true
# ── S3FileIO: Uploads ──
spark.sql.catalog.iceberg.s3.multipart.size=33554432
spark.sql.catalog.iceberg.s3.multipart.num-threads=4
spark.sql.catalog.iceberg.s3.checksum-enabled=true
spark.sql.catalog.iceberg.s3.preload-client-enabled=true
# ── S3FileIO: Deletes ──
spark.sql.catalog.iceberg.s3.delete.batch-size=250
spark.sql.catalog.iceberg.s3.delete.num-threads=4
# ── S3FileIO: Encryption ──
spark.sql.catalog.iceberg.s3.sse.type=s3
# ── S3FileIO: Tags (for cost allocation) ──
spark.sql.catalog.iceberg.s3.write.table-tag-enabled=true
spark.sql.catalog.iceberg.s3.write.namespace-tag-enabled=true
spark.sql.catalog.iceberg.s3.write.tags.environment=production
# ── S3FileIO: Storage class ──
spark.sql.catalog.iceberg.s3.write.storage-class=INTELLIGENT_TIERING
# ── S3FileIO: Retries ──
spark.sql.catalog.iceberg.s3.retry.num-retries=5
spark.sql.catalog.iceberg.s3.retry.min-wait-ms=100
spark.sql.catalog.iceberg.s3.retry.max-wait-ms=13000
Table-Level Overrides
Remember that ObjectStoreLocationProvider is set per table, not per catalog:
CREATE TABLE iceberg.analytics.high_volume_events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
event_type STRING,
payload STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
'write.object-storage.enabled' = 'true',
'write.data.path' = 's3://production-bucket/warehouse/analytics/high_volume_events/data',
'write.distribution-mode' = 'hash',
'write.parquet.compression-codec' = 'zstd'
);
S3FileIO Properties Quick Reference
Here is every S3FileIO property in one table for quick reference:
| Property | Default | Description |
|---|---|---|
s3.endpoint | — | Custom S3 endpoint URL |
s3.path-style-access | false | Use path-style access (for MinIO, LocalStack) |
s3.access-key-id | — | Static access key (prefer IAM roles instead) |
s3.secret-access-key | — | Static secret key (prefer IAM roles instead) |
s3.session-token | — | Static session token |
s3.multipart.size | 32 MB | Part size for multipart uploads |
s3.multipart.num-threads | CPU count | Threads for uploading parts |
s3.multipart.threshold-factor | 1.5 | Factor to switch from PUT to multipart |
s3.staging-dir | System temp | Local staging directory for parts |
s3.checksum-enabled | true | eTag validation on writes |
s3.delete.batch-size | 250 | Files per bulk delete request |
s3.delete.num-threads | CPU count | Threads for delete operations |
s3.delete-enabled | true | Whether deletes are allowed |
s3.delete.tags.* | — | Tags applied before soft-delete |
s3.write.tags.* | — | Tags applied during writes |
s3.write.table-tag-enabled | false | Auto-tag with table name |
s3.write.namespace-tag-enabled | false | Auto-tag with namespace |
s3.write.storage-class | — | S3 storage class for writes |
s3.sse.type | none | Encryption: none, s3, kms, dsse-kms, custom |
s3.sse.key | — | KMS key ARN or base64 AES key |
s3.sse.md5 | — | MD5 digest for SSE-C |
s3.acceleration-enabled | false | S3 Transfer Acceleration |
s3.dualstack-enabled | false | IPv4/IPv6 dual-stack endpoints |
s3.cross-region-access-enabled | false | Cross-region bucket access |
s3.use-arn-region-enabled | false | Cross-region Access Point calls |
s3.access-points.* | — | Bucket-to-Access-Point mapping |
s3.remote-signing-enabled | false | Remote request signing |
s3.preload-client-enabled | false | Initialize client eagerly |
s3.acl | — | Canned ACL for writes |
s3.client.factory | — | Custom S3 client factory class |
s3.retry.num-retries | 5 | Max retries for S3 operations |
s3.retry.min-wait-ms | 100 | Min backoff between retries |
s3.retry.max-wait-ms | 13000 | Max backoff between retries |
s3.access-grants.enabled | false | S3 Access Grants integration |
s3.access-grants.fallback-to-iam | true | Fallback to IAM if Access Grants denies |
Production Checklist
Use this checklist when setting up Iceberg on AWS:
FileIO:
- Set
io-impltoorg.apache.iceberg.aws.s3.S3FileIO - Use
s3://paths, nots3a:// - Enable
s3.preload-client-enabledfor short-lived applications - Set
s3.checksum-enabled=true(default)
Throttling prevention:
- Enable
write.object-storage.enabledon high-volume tables - Keep
s3.delete.batch-sizeat 250 (default) - Monitor S3
503 SlowDownerrors in CloudWatch
Encryption:
- Set
s3.sse.typeto at leasts3for encryption at rest - Use
kmswith S3 Bucket Keys for KMS cost optimization - Audit encryption settings with S3 Storage Lens
Cost management:
- Enable
s3.write.table-tag-enabledands3.write.namespace-tag-enabled - Set
s3.write.storage-class=INTELLIGENT_TIERINGfor variable-access tables - Use S3 Lifecycle rules to transition old snapshots to cheaper tiers
Catalog (Glue):
- Set
glue.skip-archive=truefor streaming tables - Ensure AWS SDK >= 2.17.131 for optimistic locking
- Remove DynamoDB lock manager config if using Glue 4.0+
Security:
- Use IAM roles, not static credentials (
s3.access-key-id) - Evaluate credential vending for multi-tenant deployments
- Evaluate
s3.remote-signing-enabledfor regulated environments - Review S3 Access Points for network segmentation
How Cazpian Handles This
On Cazpian, S3FileIO is the default — every Iceberg table uses it out of the box. Cazpian's managed Spark clusters come pre-configured with optimized S3FileIO settings: progressive multipart uploads, Glue catalog with optimistic locking, and SSE-S3 encryption enabled by default. ObjectStoreLocationProvider is automatically enabled for tables that exceed configurable write thresholds. You focus on your data — we handle the infrastructure.
What's Next
This post covered the storage and catalog layer. For related topics, see our other posts in this series:
- Iceberg Table Design: Properties, Partitioning, and Commit Best Practices — how to design table properties, partition specs, and commit settings.
- Iceberg Query Performance Tuning — partition pruning, bloom filters, and Spark read configs.
- Backup, Recovery, and Disaster Recovery — protecting your tables with S3 versioning, register_table, and cross-region DR.
- Writing Efficient MERGE INTO Queries — push-down predicates, COW vs MOR, and compaction after merges.
- Iceberg CDC Patterns and Best Practices — real-time CDC pipelines with Flink and Spark.