Skip to main content

Iceberg Table Design: Properties, Partitioning, and Commit Best Practices

· 26 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Table Design

You have just migrated to Apache Iceberg — or you are about to create your first Iceberg table. You open the documentation and find dozens of table properties, multiple partition transforms, and configuration knobs that interact with each other in non-obvious ways. Where do you start? Which properties actually matter? How many buckets should you use? What happens when two jobs write to the same table at the same time?

This guide answers all of those questions. We will walk through every table property that matters for production Iceberg tables, explain how to design partition specs that balance read and write performance, cover commit conflict resolution, and give you concrete recommendations for both partitioned and non-partitioned tables.

The Properties That Matter

Iceberg tables are configured through table properties — key-value pairs set at table creation or altered afterward. There are dozens of properties, but a handful drive the majority of production behavior. We will organize them by category.

Detailed overview of Iceberg table design covering write properties, partition transforms, distribution modes, commit conflict resolution, and sizing guidelines for production tables

Write Properties

These properties control how Iceberg writes data files.

PropertyDefaultWhat It Does
write.target-file-size-bytes536870912 (512 MB)Target size for each data file. Iceberg closes a file and starts a new one when this threshold is reached.
write.distribution-modenoneHow Spark distributes rows before writing. Options: none, hash, range.
write.parquet.compression-codecgzipCompression codec for Parquet files. Options: gzip, snappy, lz4, zstd, uncompressed.
write.parquet.row-group-size-bytes134217728 (128 MB)Target size for Parquet row groups within a file.
write.parquet.page-size-bytes1048576 (1 MB)Target page size within Parquet row groups.
write.spark.fanout.enabledfalseWhen true, Spark keeps one writer open per partition simultaneously. Avoids shuffle but uses more memory.
write.object-storage.enabledfalseWhen true, Iceberg generates randomized file paths to avoid S3 throttling on prefix-heavy writes.
write.data.path(table location)Custom path for data files. Useful for separating data from metadata on different storage tiers.

Key recommendations:

  • Set write.target-file-size-bytes to 256 MB (268435456) for most workloads. The 512 MB default is fine for large analytical tables, but 256 MB gives a better balance between file pruning granularity and read parallelism. Tables under 10 GB can go as low as 128 MB.

  • Set write.parquet.compression-codec to zstd. Zstandard consistently outperforms gzip in both compression ratio and speed. It is the industry standard for modern data lakehouses.

  • Always set write.distribution-mode explicitly. The none default means Spark writes data in whatever order it arrives — which can create many small files when writing to partitioned tables. Use hash for partitioned tables and none only for non-partitioned tables or single-partition appends.

  • Enable write.object-storage.enabled on S3. S3 can throttle requests when many files share a common prefix. This property randomizes paths to distribute load across S3 partitions.

Partition Distribution Mode Deep Dive

The write.distribution-mode property deserves special attention because it is the single most impactful property for write correctness on partitioned tables.

ModeBehaviorBest For
noneNo shuffle. Each Spark task writes to whatever partitions its data belongs to.Non-partitioned tables. Single-partition appends.
hashShuffle by partition key. All rows for a partition land on the same writer task.Most partitioned tables. Batch ETL.
rangeSort by partition key + optional sort columns. Rows arrive in sorted order.Sorted tables. Range-partitioned tables.

When write.distribution-mode is none on a partitioned table, each Spark task may write tiny files to many partitions. With 200 tasks writing to 365 daily partitions, you can get 73,000 small files in a single job. Setting it to hash ensures each partition gets one well-sized file per task — a dramatic reduction.

-- Recommended: hash distribution for partitioned tables
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.distribution-mode' = 'hash'
);

For tables with a sort order defined, Spark automatically uses range distribution to honor the sort. You can also force it:

ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.distribution-mode' = 'range'
);

Commit Properties

These properties control how Iceberg handles concurrent writes and commit retries.

PropertyDefaultWhat It Does
commit.retry.num-retries4Number of times to retry a commit after a conflict.
commit.retry.min-wait-ms100Minimum wait time between retries (exponential backoff).
commit.retry.max-wait-ms60000 (60s)Maximum wait time between retries.
commit.manifest.target-size-bytes8388608 (8 MB)Target size for manifest files during commit.
commit.manifest-merge.enabledtrueWhether to merge small manifest files during commit.
commit.status-check-retry-wait-ms1000How long to wait before checking if an unknown commit succeeded.

Key recommendations:

  • Increase commit.retry.num-retries to 10 for high-concurrency tables. If multiple jobs write to the same table (for example, several streaming jobs appending to different partitions), the default of 4 retries may not be enough. Each retry re-reads the latest metadata and re-applies the change.

  • Leave commit.manifest-merge.enabled at true. This merges small manifest files during each commit, keeping the metadata tree compact. Disabling it can cause manifest file explosion over time.

-- High-concurrency table configuration
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'commit.retry.num-retries' = '10',
'commit.retry.min-wait-ms' = '200',
'commit.retry.max-wait-ms' = '120000'
);

Write Isolation and Conflict Resolution

Iceberg uses optimistic concurrency control (OCC). When two writers commit at the same time, one succeeds and the other must retry. Whether the retry succeeds depends on the isolation level and whether the writes conflict.

Iceberg supports two isolation levels:

Isolation LevelBehaviorSet Via
Serializable (default)Strict. Conflicts if any concurrent operation could have affected the result.Default for all operations.
SnapshotRelaxed. Only conflicts if two operations modify the same files or partitions.write.delete.isolation-level = snapshot

When do conflicts happen?

  • Append + Append to different partitions: No conflict (both isolation levels). This is the most common case in streaming and partitioned ETL.
  • Append + Append to the same partition: No conflict in snapshot isolation. Conflicts in serializable only if a delete or overwrite is involved.
  • Overwrite + Overwrite on overlapping partitions: Always conflicts. The second writer must retry.
  • Append + Delete/Overwrite on overlapping data: Conflicts in serializable. No conflict in snapshot isolation if they touch different files.

For most data pipelines, the default serializable isolation with 4-10 retries works well. Only switch to snapshot isolation when you have a specific need (for example, a streaming append job running alongside a batch overwrite job that never touches the same partitions).

-- Snapshot isolation for delete operations (relaxed conflict detection)
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.delete.isolation-level' = 'snapshot'
);

Copy-on-Write vs Merge-on-Read

Iceberg supports two strategies for handling updates and deletes:

PropertyDefaultOptions
write.delete.modecopy-on-writecopy-on-write, merge-on-read
write.update.modecopy-on-writecopy-on-write, merge-on-read
write.merge.modecopy-on-writecopy-on-write, merge-on-read

Copy-on-Write (COW): When you delete or update a row, Iceberg rewrites the entire data file that contains that row, excluding the deleted/modified rows. Reads are fast because there are no delete files to reconcile. Writes are slower because entire files get rewritten.

Merge-on-Read (MOR): When you delete or update a row, Iceberg writes a small delete file that marks the row as deleted. Reads must merge data files with delete files at query time. Writes are fast. Reads are slower until compaction merges the delete files back into data files.

ScenarioRecommended ModeWhy
Infrequent deletes/updates, heavy readsCopy-on-WriteReads are dominant — keep them fast.
Frequent deletes/updates, fewer readsMerge-on-ReadAvoid constant file rewrites.
CDC / streaming upsertsMerge-on-ReadWrite throughput matters more. Compact periodically.
Regulatory deletes (GDPR) on large tablesMerge-on-ReadDelete a few rows without rewriting terabytes.
-- Merge-on-Read for tables with frequent updates
ALTER TABLE analytics.user_profiles
SET TBLPROPERTIES (
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
);

Metadata Properties

These properties control how Iceberg manages its own metadata files.

PropertyDefaultWhat It Does
write.metadata.previous-versions-max100Number of previous metadata files to keep. Older metadata files are deleted during each commit.
write.metadata.delete-after-commit.enabledfalseWhen true, old metadata files are automatically deleted after commit.
write.metadata.metrics.defaulttruncate(16)Default column statistics mode for all columns.
write.metadata.metrics.column.*(inherits default)Per-column override for statistics mode.

Key recommendations:

  • Set write.metadata.previous-versions-max to 10-20 for high-frequency write tables. Streaming tables that commit every minute will accumulate thousands of metadata files if you keep the default of 100. Reducing this keeps the metadata directory clean.

  • Enable write.metadata.delete-after-commit.enabled if your table has frequent commits. This automatically cleans up metadata files that exceed the previous-versions-max threshold, preventing metadata file bloat.

-- Metadata management for streaming tables
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.metadata.previous-versions-max' = '15',
'write.metadata.delete-after-commit.enabled' = 'true'
);

We will cover write.metadata.metrics.* in detail in the next blog post on query performance tuning, where we discuss how column statistics drive file pruning.

Designing Partition Specs

Partitioning is the single most important design decision for an Iceberg table. A good partition spec can make queries 10-100x faster. A bad one can make them slower than no partitioning at all.

Iceberg Partition Transforms

Unlike Hive, Iceberg does not require you to create a separate column for partitioning. Instead, you apply a transform to an existing column. This is called hidden partitioning — the partition structure is invisible to the query writer.

TransformSyntaxWhat It ProducesExample
IdentityPARTITIONED BY (region)Exact column valueregion=us-east-1
YearPARTITIONED BY (year(event_time))Year extracted from timestampevent_time_year=2026
MonthPARTITIONED BY (month(event_time))Year + monthevent_time_month=2026-02
DayPARTITIONED BY (day(event_time))Year + month + dayevent_time_day=2026-02-12
HourPARTITIONED BY (hour(event_time))Year + month + day + hourevent_time_hour=2026-02-12-14
BucketPARTITIONED BY (bucket(16, user_id))Hash mod Nuser_id_bucket=7
TruncatePARTITIONED BY (truncate(4, zip_code))First L characters/digitszip_code_trunc=9021

Choosing the Right Transform

Time-based transforms (year, month, day, hour) are the most common. The choice depends on data volume per time unit:

Daily Data VolumeRecommended TransformResulting Partition Size
< 10 MB/daymonth or no partitioning300 MB - 3 GB per partition
10 MB - 1 GB/dayday10 MB - 1 GB per partition
1 GB - 100 GB/dayday1 GB - 100 GB per partition
> 100 GB/dayday + bucket compound partitionDistributed within each day
> 1 TB/hourhourManageable per-partition sizes

The target: each partition should contain 128 MB to 1 GB of data after compression. This gives Iceberg enough data to write well-sized files (256 MB each) while keeping the partition count manageable.

Bucket transforms distribute data by hashing a column value into N buckets. They are ideal for high-cardinality columns that do not have a natural time dimension:

-- Bucket partitioning for a user-keyed table
CREATE TABLE analytics.user_events (
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP,
payload STRING
)
USING iceberg
PARTITIONED BY (bucket(16, user_id), day(event_time))
TBLPROPERTIES (
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '268435456'
);

How Many Buckets?

This is the most common question teams ask about bucket partitioning. The answer comes from simple math:

Buckets = Total daily data volume / Target partition size

Example:
50 GB/day with day() partitioning = 50 GB per daily partition
Target partition size = 1 GB
Buckets = 50 / 1 = 50 → round to bucket(64, user_id)

Rules of thumb:

Daily Data Volume (per day partition)Recommended Buckets
< 1 GBDo not use bucket. Day partition alone is sufficient.
1 GB - 10 GBbucket(8, col) or bucket(16, col)
10 GB - 100 GBbucket(32, col) or bucket(64, col)
100 GB - 1 TBbucket(128, col) or bucket(256, col)
> 1 TBbucket(512, col) or bucket(1024, col)

Important: Choose bucket counts that are powers of 2. This aligns with hash distribution and makes future partition evolution (doubling or halving buckets) cleaner.

Warning: Over-bucketing is worse than under-bucketing. If you create bucket(1024, user_id) but only have 10 GB of daily data, each bucket gets ~10 MB — well below the target file size. This creates thousands of undersized files per day.

Compound Partition Specs

Real-world tables often need compound partition specs combining time and bucket transforms:

-- High-volume event table: day + bucket
CREATE TABLE analytics.clickstream (
session_id STRING,
user_id BIGINT,
page_url STRING,
event_time TIMESTAMP,
properties MAP<STRING, STRING>
)
USING iceberg
PARTITIONED BY (day(event_time), bucket(32, user_id))
TBLPROPERTIES (
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '268435456',
'write.parquet.compression-codec' = 'zstd'
);

The order of transforms in the partition spec matters for query performance. Put the most commonly filtered column first. If most queries filter by event_time, put day(event_time) before bucket(user_id).

The Truncate Transform

The truncate transform is useful for partitioning on string or numeric columns where you want to group similar values:

-- Group by zip code prefix (first 3 digits)
CREATE TABLE geo.locations (
location_id BIGINT,
zip_code STRING,
city STRING,
state STRING
)
USING iceberg
PARTITIONED BY (truncate(3, zip_code));

truncate(3, '90210') produces '902', grouping all zip codes starting with 902 into the same partition.

For numeric columns, truncate(N, col) rounds down to the nearest multiple of N:

  • truncate(1000, 54321) = 54000
  • truncate(100, 54321) = 54300

Non-Partitioned Tables: When and How

Not every table needs partitioning. Small tables and tables without a natural partition key can perform better without partitioning.

When to Skip Partitioning

  • Table size under 5 GB. Full table scans are fast at this scale. Partitioning adds metadata overhead without significant pruning benefit.
  • No natural filter column. If queries do not consistently filter on any particular column, partitioning creates overhead without pruning benefit.
  • Very wide tables with few rows. Tables with thousands of columns but relatively few rows (like configuration or dimension tables) do not benefit from partitioning.

Best Practices for Non-Partitioned Tables

-- Non-partitioned table with proper write configuration
CREATE TABLE analytics.dim_products (
product_id BIGINT,
product_name STRING,
category STRING,
subcategory STRING,
price DECIMAL(10,2),
created_at TIMESTAMP,
updated_at TIMESTAMP
)
USING iceberg
TBLPROPERTIES (
'write.distribution-mode' = 'none',
'write.target-file-size-bytes' = '134217728',
'write.parquet.compression-codec' = 'zstd',
'format-version' = '2'
);

For non-partitioned tables:

  • Set write.distribution-mode to none. There are no partitions to hash by, so any shuffle is wasted.
  • Use a smaller target file size (128 MB) since the entire table is read on any query. Smaller files mean better parallelism.
  • Consider adding a sort order to improve file pruning through min/max statistics.

Adding Sort Order for Better Pruning

Even without partitioning, you can dramatically improve query performance by sorting data within files. When files are sorted, the min/max column statistics in manifests become much more effective for file pruning:

-- Sort order on a non-partitioned table
ALTER TABLE analytics.dim_products
WRITE ORDERED BY (category, subcategory);

After setting a sort order, run compaction to physically rewrite the data in sorted order:

CALL system.rewrite_data_files(
table => 'analytics.dim_products',
strategy => 'sort'
);

We will cover how sort order interacts with file pruning and min/max statistics in detail in our next blog post on Iceberg Query Performance Tuning.

Table Sizing Guidelines

Getting table and partition sizing right prevents the two most common Iceberg performance problems: too many small files and too few large files.

Target Sizes

ComponentRecommended SizeWhy
Data file128 MB - 512 MBBalances S3 GET cost, parallelism, and file pruning granularity.
Partition128 MB - 1 GB (compressed)Produces 1-4 well-sized data files per partition.
Row group64 MB - 128 MBMatches Parquet reader defaults. Allows efficient column projection.
Manifest file8 MB (default)Keeps planning time fast. One manifest should track thousands of data files.
Metadata fileKeep under 50 MBLarge metadata files slow down query planning.

The Math Behind Partition Sizing

Here is a formula to validate your partition spec design:

Files per partition = Partition data volume / write.target-file-size-bytes

Example:
Partition volume: 768 MB (compressed)
Target file size: 256 MB
Files per partition: 768 / 256 = 3 files ✓ (Good)

Bad example:
Partition volume: 5 MB (compressed)
Target file size: 256 MB
Files per partition: 5 / 256 = 0.02 → 1 tiny file ✗ (Undersized)

If the math shows that most partitions produce less than one full-sized file, your partitions are too granular. Either use a coarser time transform (month instead of day) or reduce the number of buckets.

Monitoring Table Health

Use Iceberg metadata tables to check if your sizing is healthy:

-- Check file size distribution
SELECT
partition,
COUNT(*) AS file_count,
ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_file_mb,
ROUND(MIN(file_size_in_bytes) / 1048576, 1) AS min_file_mb,
ROUND(MAX(file_size_in_bytes) / 1048576, 1) AS max_file_mb,
SUM(record_count) AS total_records
FROM analytics.events.files
GROUP BY partition
ORDER BY avg_file_mb ASC
LIMIT 20;
-- Check manifest health
SELECT
COUNT(*) AS manifest_count,
SUM(added_data_files_count + existing_data_files_count) AS total_file_entries,
ROUND(AVG(length) / 1048576, 2) AS avg_manifest_mb
FROM analytics.events.manifests;

If avg_file_mb is consistently below 100 MB, you likely need to adjust your distribution mode, reduce bucket count, or use a coarser time transform. If manifest count is growing beyond a few hundred, ensure commit.manifest-merge.enabled is true.

Complete Table Creation Templates

Here are production-ready templates for the most common table patterns. Each template includes the scenario it solves, why the properties were chosen, and the trade-offs involved.

Template 1: High-Volume Event Table (Partitioned by Day + Bucket)

The scenario: Your product team collects clickstream, page views, and user interaction events. The table receives 20-80 GB of data per day from multiple Spark Structured Streaming jobs running concurrently. Dashboards query the last 7 days by event_time, and data scientists frequently filter by user_id to analyze individual user journeys.

Why this design:

  • day(event_time) partitioning because most queries filter by date range. Daily partitions at 20-80 GB are well-sized — large enough for multiple 256 MB files per partition.
  • bucket(32, user_id) because 20-80 GB per day is too large for a single daily partition to serve user-level lookups efficiently. With 32 buckets, each bucket holds 625 MB - 2.5 GB per day — the sweet spot for file pruning on user_id.
  • hash distribution because multiple streaming jobs write concurrently. Without hash distribution, each streaming micro-batch would scatter tiny files across all 32 buckets in each daily partition.
  • commit.retry.num-retries = 10 because concurrent streaming writers will occasionally conflict on commit. Ten retries with exponential backoff handles this gracefully.
  • Metadata auto-cleanup because streaming jobs commit every 30-60 seconds, generating thousands of metadata files per day. Without cleanup, the metadata directory bloats and slows query planning.
  • write.object-storage.enabled because high-throughput writes to S3 can trigger 503 SlowDown errors when many files share a common prefix. Randomized paths distribute load across S3 partitions.
CREATE TABLE analytics.events (
event_id STRING,
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP,
properties MAP<STRING, STRING>,
session_id STRING,
device_type STRING,
country STRING
)
USING iceberg
PARTITIONED BY (day(event_time), bucket(32, user_id))
TBLPROPERTIES (
'format-version' = '2',
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '268435456',
'write.parquet.compression-codec' = 'zstd',
'write.parquet.row-group-size-bytes' = '67108864',
'write.object-storage.enabled' = 'true',
'write.metadata.previous-versions-max' = '20',
'write.metadata.delete-after-commit.enabled' = 'true',
'commit.retry.num-retries' = '10',
'commit.manifest-merge.enabled' = 'true'
);

Post-creation maintenance: Schedule daily rewrite_data_files with binpack strategy during off-peak hours to merge the small files that streaming micro-batches inevitably create. Run expire_snapshots weekly to clean up the high snapshot count.


Template 2: Daily Batch Table (Partitioned by Day)

The scenario: Your ETL pipeline runs once per day (typically via Airflow or a scheduled Spark job). It computes daily aggregated metrics — revenue by region, active users by segment, conversion rates by funnel stage. Each day's batch is a clean INSERT OVERWRITE that replaces the previous day's partition. The table is read by BI dashboards (Tableau, Looker, Superset) that query the last 30-90 days.

Why this design:

  • day(report_date) partitioning is the natural choice because data arrives daily and dashboards always filter by date range. Each daily partition is typically 50-500 MB — well within the recommended range.
  • hash distribution because even though there is only one write job per day, that job may produce data for multiple days (backfills, late-arriving corrections). Hash ensures each daily partition gets well-sized files.
  • No bucket partitioning because the daily data volume (50-500 MB) is small enough that a single daily partition works perfectly. Adding buckets would over-partition and create undersized files.
  • No metadata auto-cleanup because with only one commit per day, metadata growth is negligible. The default 100 previous versions is more than sufficient.
  • No elevated commit retries because there is only one writer. No concurrent write conflicts are possible.
CREATE TABLE analytics.daily_aggregates (
report_date DATE,
metric_name STRING,
dimension STRING,
value DOUBLE,
row_count BIGINT,
updated_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (day(report_date))
TBLPROPERTIES (
'format-version' = '2',
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '268435456',
'write.parquet.compression-codec' = 'zstd'
);

Post-creation maintenance: Minimal. Run expire_snapshots monthly to clean up old snapshots. Compaction is rarely needed since each batch write produces well-sized files.


Template 3: Slowly Changing Dimension (Non-Partitioned, Sorted)

The scenario: Your customer dimension table holds 2-5 million rows (under 2 GB). It is updated throughout the day as customers change their profiles, upgrade plans, or update addresses. Every analytical query in your warehouse joins to this table — by customer_id for lookups, or by segment for aggregations. The table is small but queried thousands of times per day.

Why this design:

  • No partitioning because the table is under 5 GB. Partitioning a 2 GB table would create metadata overhead without meaningful pruning benefit. A full scan of 2 GB takes under 2 seconds on modern hardware.
  • write.distribution-mode = none because there are no partitions to distribute across. Any shuffle would be wasted compute.
  • 128 MB target file size (smaller than the 256 MB default) because the entire table fits in a few files. Smaller files create more Spark tasks, improving read parallelism for the many concurrent queries hitting this table.
  • Merge-on-read for deletes and updates because this table receives frequent small updates (individual customer changes). Copy-on-write would rewrite an entire 256 MB file to change one row. Merge-on-read writes a tiny delete file instead — much faster for frequent, small mutations.
  • Sort order on (segment, customer_id) because the most common query patterns are WHERE segment = 'enterprise' (aggregations) and WHERE customer_id = 12345 (lookups). Sorting makes min/max statistics effective for file pruning even without partitioning.
CREATE TABLE analytics.dim_customers (
customer_id BIGINT,
full_name STRING,
email STRING,
segment STRING,
lifetime_value DECIMAL(12,2),
created_at TIMESTAMP,
updated_at TIMESTAMP
)
USING iceberg
TBLPROPERTIES (
'format-version' = '2',
'write.distribution-mode' = 'none',
'write.target-file-size-bytes' = '134217728',
'write.parquet.compression-codec' = 'zstd',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read'
);

-- Add sort order for query performance
ALTER TABLE analytics.dim_customers
WRITE ORDERED BY (segment, customer_id);

Post-creation maintenance: Schedule weekly compaction with sort strategy to merge accumulated delete files back into data files and restore sorted order. Without periodic compaction, merge-on-read tables accumulate delete files that progressively slow down reads.


Template 4: CDC / Streaming Upsert Table

The scenario: You are replicating an operational PostgreSQL or MySQL database into your lakehouse using a CDC tool (Debezium, AWS DMS, or Airbyte). The CDC stream delivers inserts, updates, and deletes for a user_profiles table. Your Spark Structured Streaming job consumes these CDC events and applies them as MERGE operations against the Iceberg table. The table receives thousands of upserts per minute across millions of users. Downstream consumers include recommendation engines and fraud detection systems that need near-real-time data.

Why this design:

  • bucket(16, user_id) partitioning because CDC upserts are distributed across all users — there is no natural time-based partitioning. Bucketing by user_id ensures that MERGE operations can efficiently locate the existing row to update. With 16 buckets, each MERGE only scans 1/16th of the table to find matching rows.
  • hash distribution to align the write shuffle with the bucket partition spec. This ensures each Spark task writes to exactly one bucket, producing well-sized files.
  • Merge-on-read for all operations because CDC workloads are update-heavy. A single CDC batch might update 10,000 rows spread across 5,000 files. Copy-on-write would rewrite all 5,000 files. Merge-on-read writes 5,000 tiny delete files instead — orders of magnitude faster for the write path.
  • Metadata auto-cleanup because streaming CDC commits happen every 30-60 seconds, generating thousands of metadata versions per day.
  • Elevated commit retries because multiple CDC streams (from different source tables) may share the same Iceberg catalog, and the streaming commit interval means frequent commit attempts.
CREATE TABLE analytics.user_profiles (
user_id BIGINT,
username STRING,
email STRING,
profile_data STRING,
last_login TIMESTAMP,
updated_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (bucket(16, user_id))
TBLPROPERTIES (
'format-version' = '2',
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '268435456',
'write.parquet.compression-codec' = 'zstd',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read',
'write.metadata.previous-versions-max' = '15',
'write.metadata.delete-after-commit.enabled' = 'true',
'commit.retry.num-retries' = '10'
);

Post-creation maintenance: This is the most maintenance-intensive pattern. Schedule compaction with binpack strategy every 4-6 hours to merge delete files back into data files. Without frequent compaction, merge-on-read tables degrade rapidly — a table with 100,000 accumulated delete files can be 10x slower to read than a freshly compacted one. Also run expire_snapshots daily to clean up the high snapshot count from streaming commits.


Template 5: Large Analytical Table (Month Partitioned)

The scenario: Your finance team maintains a transactions table spanning 3-5 years of history. The table is append-only — no updates or deletes (regulatory requirement to preserve every record). Total size is 5-20 TB. A single batch job appends 200-500 MB of new transactions per day. Analysts query this table for quarterly and annual reports, always filtering by time range. The table also supports audit queries that may scan an entire quarter.

Why this design:

  • month(created_at) partitioning instead of day because 200-500 MB per day is below the ideal partition size. Monthly partitions at 6-15 GB each are well within the sweet spot. Daily partitioning would create 365 partitions per year with undersized files; monthly creates 12 partitions per year with optimal file sizes.
  • 512 MB target file size (larger than the 256 MB default) because the table is massive (5-20 TB) and queries scan entire months or quarters. Larger files reduce the total file count and manifest size, speeding up query planning. A 15 GB monthly partition with 512 MB files produces ~30 files — manageable and efficient.
  • No bucket partitioning because analytical queries do not filter by account_id or transaction_id consistently enough to justify the extra partitioning dimension. The primary access pattern is time-range scans.
  • write.object-storage.enabled because with terabytes of data, S3 prefix throttling becomes a real concern during large scans or compaction operations.
  • No merge-on-read because the table is append-only. There are no deletes or updates, so copy-on-write (the default) is the correct choice — it keeps reads as fast as possible with no delete files to reconcile.
  • No metadata auto-cleanup or elevated retries because there is only one writer per day, and commit frequency is low.
CREATE TABLE analytics.transactions (
transaction_id STRING,
account_id BIGINT,
amount DECIMAL(15,2),
currency STRING,
transaction_type STRING,
status STRING,
created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (month(created_at))
TBLPROPERTIES (
'format-version' = '2',
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '536870912',
'write.parquet.compression-codec' = 'zstd',
'write.object-storage.enabled' = 'true'
);

Post-creation maintenance: Run expire_snapshots monthly with a retention of 7 days (since the table is append-only, old snapshots are rarely needed for rollback). Run rewrite_manifests quarterly to keep the manifest tree compact as the table grows over years.


Choosing the Right Template

Your ScenarioTemplateKey Decision Factors
Streaming events, multiple writers, user-level queriesTemplate 1High volume + concurrent writes + user-level filters = day + bucket + MOR metadata cleanup
Daily batch ETL, dashboard-facing, one writerTemplate 2Low frequency writes + time-range queries = simple day partition, minimal config
Small dimension table, frequent updates, join targetTemplate 3Small size + frequent mutations + lookup queries = no partition + sort + MOR
CDC replication, streaming upserts, near-real-timeTemplate 4Update-heavy + streaming + key-based lookups = bucket + MOR + aggressive maintenance
Multi-year history, append-only, large scansTemplate 5Massive size + append-only + quarterly scans = month partition + large files + COW

Quick Reference: Property Cheat Sheet

GoalPropertyRecommended Value
Control file sizewrite.target-file-size-bytes268435456 (256 MB)
Prevent small fileswrite.distribution-modehash (partitioned) / none (non-partitioned)
Modern compressionwrite.parquet.compression-codeczstd
S3 performancewrite.object-storage.enabledtrue
Handle concurrent writescommit.retry.num-retries10
Keep metadata compactwrite.metadata.previous-versions-max15-20
Auto-clean metadatawrite.metadata.delete-after-commit.enabledtrue
Merge manifestscommit.manifest-merge.enabledtrue
CDC / frequent updateswrite.delete.modemerge-on-read
Batch / rare updateswrite.delete.modecopy-on-write
Format versionformat-version2 (required for MOR, row-level deletes)

Common Mistakes

  1. Using write.distribution-mode = none on partitioned tables. This is the number one cause of small file problems. Always use hash or range for partitioned tables.

  2. Over-bucketing. Creating bucket(1024, col) when you only have 5 GB of data per day. Start with fewer buckets and increase later — Iceberg supports partition evolution.

  3. Using format version 1. Version 2 is required for merge-on-read, row-level deletes, and position delete files. There is no reason to use version 1 for new tables.

  4. Ignoring metadata growth on streaming tables. A streaming job committing every 30 seconds generates 2,880 metadata files per day. Set write.metadata.previous-versions-max to a low number and enable write.metadata.delete-after-commit.enabled.

  5. Setting write.target-file-size-bytes too small. Files under 32 MB create excessive metadata overhead and slow down query planning. The minimum practical size is 64 MB, and 128-512 MB is the sweet spot.

  6. Choosing hour partitioning when day would suffice. Hour partitioning creates 24x more partitions per day. Unless you are ingesting multiple GB per hour, daily partitioning is almost always better.

  7. Not setting a sort order on non-partitioned tables. Without partitioning, all queries scan all files. A sort order makes min/max statistics in manifests useful for file pruning, which can skip 80-90% of files for range queries.


In the next post, we dive into how Iceberg's query engine actually uses these properties at read time — covering partition pruning, file pruning with column statistics, bloom filter row-group filtering, and Spark read/write performance configs. Check out the rest of our Apache Iceberg series for more deep dives.