Iceberg Query Performance Tuning: Partition Pruning, Bloom Filters, and Spark Configs
Your Iceberg tables are created with the right properties. Your partitions are well-designed. But your queries are still slower than you expected. The dashboard that should load in 3 seconds takes 45. The data scientist's notebook times out. The problem is not your table design — it is that you have not tuned the layers between the query and the data.
Apache Iceberg has a sophisticated query planning pipeline that can skip entire partitions, skip individual files within a partition, and even skip row groups within a file. But each of these layers only works if you configure it correctly. This post walks through every pruning layer, explains exactly how Iceberg uses metadata to skip work, and gives you the Spark configurations to control it all.
The Three Layers of Iceberg Query Pruning
When Spark executes a query against an Iceberg table, the query planner evaluates filters at three progressively finer levels:
Query: SELECT * FROM events WHERE event_date = '2026-02-10' AND user_id = 42
Layer 1: Partition Pruning
└── Skip all partitions except event_date = '2026-02-10'
(Eliminates 99.7% of partitions for a year of daily data)
Layer 2: File Pruning (Data Skipping)
└── Within the matching partition, check min/max statistics
in manifest files to skip data files that cannot contain user_id = 42
(Eliminates 60-95% of remaining files if data is sorted)
Layer 3: Row Group Filtering (Bloom Filters)
└── Within each remaining file, check Parquet bloom filters
to skip row groups that definitely do not contain user_id = 42
(Eliminates 80-99% of remaining row groups for point lookups)
Each layer is independent. You can use any combination. But the biggest performance gains come from enabling all three.
Layer 1: Partition Pruning
Partition pruning is the coarsest and most impactful optimization. It eliminates entire directories of data files based on the partition spec.
How It Works
When you create a table with PARTITIONED BY (day(event_time)), Iceberg records the partition transform in the table metadata. At query time, when Spark encounters a filter like WHERE event_time >= '2026-02-01' AND event_time < '2026-02-08', the query planner:
- Reads the manifest list (a single small file).
- For each manifest, checks the partition summary (min/max partition values stored in the manifest list).
- Skips manifests whose partition ranges do not overlap the filter.
- For remaining manifests, reads manifest entries and skips files in non-matching partitions.
The key insight: Iceberg understands the relationship between the source column and the partition transform. A filter on event_time automatically prunes partitions defined by day(event_time) — the user never needs to reference the partition column directly. This is the "hidden partitioning" that makes Iceberg fundamentally different from Hive.
What Triggers Partition Pruning
| Filter Type | Prunes? | Example |
|---|---|---|
| Equality on source column | Yes | WHERE event_time = '2026-02-10 14:30:00' |
| Range on source column | Yes | WHERE event_time BETWEEN '2026-02-01' AND '2026-02-07' |
| IN list on source column | Yes | WHERE event_date IN ('2026-02-10', '2026-02-11') |
| Equality on partition transform | Yes | WHERE day(event_time) = '2026-02-10' |
| Function on source column | Depends | WHERE year(event_time) = 2026 — yes. WHERE UPPER(region) = 'US' — no. |
| No filter on partition column | No | WHERE user_id = 42 (if partitioned only by time) |
| OR across partition columns | Partial | WHERE event_date = '2026-02-10' OR user_id = 42 — the OR prevents full pruning. |
Verifying Partition Pruning
You can check how many files Spark scans by looking at the query plan:
EXPLAIN SELECT COUNT(*) FROM analytics.events
WHERE event_time >= '2026-02-10'
AND event_time < '2026-02-11';
Look for filters and residualFilter in the output. If the partition filter appears under dataFilters instead of being eliminated at the plan level, pruning may not be working as expected.
You can also check scan metrics after execution:
-- Check how many files were scanned vs total
SELECT
COUNT(*) AS total_files,
SUM(CASE WHEN partition.event_time_day = DATE '2026-02-10' THEN 1 ELSE 0 END) AS target_files
FROM analytics.events.files;
Layer 2: File Pruning with Column Statistics
Even after partition pruning narrows the scan to a single day, you may still have thousands of files in that partition. File pruning (also called data skipping) uses min/max column statistics stored in manifest files to skip individual data files.
How Column Statistics Work
Every time Iceberg writes a data file, it records statistics for each column:
| Statistic | What It Stores | Used For |
|---|---|---|
value_counts | Number of non-null values per column | Cardinality estimation |
null_value_counts | Number of null values per column | IS NULL / IS NOT NULL filters |
nan_value_counts | Number of NaN values (floats) | NaN handling |
lower_bounds | Minimum value per column | Range filter pruning |
upper_bounds | Maximum value per column | Range filter pruning |
When Spark evaluates WHERE user_id = 42, the query planner checks each file's manifest entry:
- If the file's
lower_bound(user_id)is 100 andupper_bound(user_id)is 500, user_id 42 cannot be in this file — skip it. - If the file's
lower_bound(user_id)is 1 andupper_bound(user_id)is 1000, user_id 42 might be in this file — read it.
Metrics Collection Modes
Iceberg supports four modes for collecting column statistics, controlled by the write.metadata.metrics.default table property:
| Mode | What It Collects | Storage Overhead | Best For |
|---|---|---|---|
none | Nothing | Zero | Columns never used in filters (large BLOBs, JSON strings). |
counts | Value counts and null counts only | Minimal | Columns used for IS NULL checks but not range filters. |
truncate(N) | Counts + truncated min/max (first N bytes) | Low | String columns. Default is truncate(16). |
full | Counts + full min/max values | Moderate | Numeric columns, timestamps, and short strings used in filters. |
The default is truncate(16) — which means Iceberg stores the first 16 bytes of the min and max value for string columns. For numeric columns (INT, LONG, DOUBLE, DECIMAL, TIMESTAMP), truncate(16) effectively stores the full value since these types are all 8 bytes or less.
Configuring Metrics Per Column
This is one of the most underused Iceberg features. You can set different metrics modes for different columns:
ALTER TABLE analytics.events
SET TBLPROPERTIES (
-- Default: truncate at 16 bytes (good for most columns)
'write.metadata.metrics.default' = 'truncate(16)',
-- Full metrics for columns used in range filters
'write.metadata.metrics.column.user_id' = 'full',
'write.metadata.metrics.column.event_time' = 'full',
'write.metadata.metrics.column.amount' = 'full',
-- No metrics for large payload columns (saves manifest space)
'write.metadata.metrics.column.payload' = 'none',
'write.metadata.metrics.column.raw_json' = 'none',
'write.metadata.metrics.column.properties' = 'none'
);
Why this matters: Every column's min/max statistics are stored in the manifest file. If your table has 200 columns, manifest files can become bloated with statistics for columns that are never filtered on. Setting none for large, unfiltered columns (JSON blobs, raw text, maps) reduces manifest size and speeds up query planning.
Why Sort Order Makes Statistics Effective
Here is the critical insight: min/max statistics are only useful when data is sorted or at least clustered.
Consider a file with 1 million rows of user_id values:
Unsorted file:
lower_bound(user_id)= 1upper_bound(user_id)= 10,000,000- A filter
WHERE user_id = 42matches this range — the file cannot be skipped
Sorted file (user_ids 1 through 1,000):
lower_bound(user_id)= 1upper_bound(user_id)= 1,000- A filter
WHERE user_id = 5,000does not match this range — the file is skipped
When data is randomly distributed across files, every file's min/max range spans the entire domain, making statistics useless. When data is sorted, each file covers a narrow, non-overlapping range, and statistics become highly selective.
-- Add a sort order to make file pruning effective
ALTER TABLE analytics.events
WRITE ORDERED BY (user_id);
-- Rewrite existing data to apply the sort
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort'
);
After sorting, a query like WHERE user_id = 42 might scan 2-3 files instead of 500 — a 100x improvement from sort order alone.
Z-Order: Multi-Column Sort for Multi-Column Filters
Standard sorting works well when queries filter on a single column. But what if queries filter on different columns interchangeably — sometimes user_id, sometimes country, sometimes event_type?
Z-ordering (also called multi-dimensional clustering) interleaves the sort across multiple columns, so that each file covers a small range in all dimensions simultaneously:
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort',
sort_order => 'zorder(user_id, country, event_type)'
);
| Strategy | Best For | Trade-off |
|---|---|---|
Linear sort (ORDERED BY col) | Queries always filter on the same column | Perfect pruning for that column, poor for others. |
Z-order (zorder(col1, col2, col3)) | Queries filter on varying columns | Good pruning for all columns, not perfect for any single one. |
Recommendation: Use linear sort when 80%+ of queries filter on the same column. Use z-order when query patterns are diverse. Do not z-order more than 3-4 columns — effectiveness drops significantly beyond that.
Layer 3: Bloom Filters for Row Group Filtering
The finest level of pruning happens inside individual Parquet files, at the row group level. Parquet bloom filters are probabilistic data structures that can definitively say "this value is NOT in this row group" — allowing the reader to skip row groups entirely.
How Bloom Filters Work
A bloom filter is a bit array with hash functions. When writing a row group, Iceberg hashes every value in the column and sets bits in the filter. At read time, the query engine hashes the filter value and checks the bits:
- All bits set → The value might be in this row group (read it).
- Any bit not set → The value is definitely not in this row group (skip it).
Bloom filters have a false positive rate (they sometimes say "might be here" when the value is actually absent) but never have false negatives (they never skip a row group that actually contains the value).
Enabling Bloom Filters
Bloom filters are disabled by default. You enable them per column:
ALTER TABLE analytics.events
SET TBLPROPERTIES (
-- Enable bloom filters for high-cardinality lookup columns
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true',
'write.parquet.bloom-filter-enabled.column.event_id' = 'true',
-- Control bloom filter size (default is 1MB per column per row group)
'write.parquet.bloom-filter-max-bytes' = '1048576'
);
Important: Bloom filters only help with equality predicates (WHERE user_id = 42). They do not help with range queries (WHERE user_id > 100 AND user_id < 200). For range queries, min/max statistics and sort order are the right tools.
Which Columns Should Have Bloom Filters?
| Column Type | Bloom Filter? | Why |
|---|---|---|
| High-cardinality ID columns (user_id, session_id, order_id) | Yes | Point lookups on high-cardinality columns benefit most. |
| Low-cardinality columns (status, country, event_type) | No | Only a few distinct values — min/max statistics are sufficient. |
| Columns used only in range filters | No | Bloom filters only work for equality checks. |
| Columns already used as partition keys | No | Partition pruning already handles these. |
| Very wide string columns (URLs, JSON) | No | Large values increase bloom filter false positive rates and size. |
Storage Overhead
Each bloom filter adds approximately 1 MB per column per row group (configurable via write.parquet.bloom-filter-max-bytes). For a 256 MB file with 2 row groups and bloom filters on 3 columns:
Overhead = 2 row groups × 3 columns × 1 MB = 6 MB (2.3% of file size)
This is a negligible overhead for the 80-99% row group skip rate that bloom filters provide on point lookups.
Bloom Filters + Sort Order: The Power Combination
The best performance comes from combining sort order with bloom filters:
- Sort order → Files have narrow min/max ranges → File pruning skips 80-95% of files.
- Bloom filters → Within remaining files, row groups are skipped → Row group filtering skips 80-99% of remaining row groups.
The compound effect: a point lookup on a sorted, bloom-filtered table might scan 0.01% of the total data.
-- Optimal configuration for point-lookup heavy tables
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true'
);
ALTER TABLE analytics.events
WRITE ORDERED BY (user_id);
-- Rewrite data to apply sort and generate bloom filters
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort'
);
Spark Read Configurations
Beyond table properties, Spark has its own configurations that affect how it reads Iceberg tables.
Split Planning
Spark divides the files to be read into splits — the unit of parallelism. Each split becomes one Spark task.
| Config | Default | What It Does |
|---|---|---|
read.split.target-size | 134217728 (128 MB) | Target size per split. Controls how many Spark tasks are created. |
read.split.metadata-target-size | 33554432 (32 MB) | Target size for splits when reading metadata tables. |
read.split.open-file-cost | 4194304 (4 MB) | Estimated overhead for opening a new file. Iceberg adds this to file sizes when bin-packing files into splits. |
read.split.planning-lookback | 10 | Number of splits to look back when combining small files into a single split. |
Key recommendations:
-
Reduce
read.split.target-sizeto 64 MB for latency-sensitive queries. Smaller splits create more Spark tasks, which means more parallelism. This trades cluster resource usage for faster individual query times. -
Increase
read.split.target-sizeto 256-512 MB for large batch scans. Fewer, larger splits reduce task scheduling overhead and are more efficient for full-table analytics.
-- Set at table level (affects all readers of this table)
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'read.split.target-size' = '67108864'
);
Or set at session level in Spark:
spark.conf.set("spark.sql.iceberg.split-size", "67108864")
Vectorized Reads
Spark can read Parquet files using vectorized (columnar batch) processing, which is significantly faster than row-based reads:
# Ensure vectorized reads are enabled (default in modern Spark)
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "true")
Vectorized reads process data in columnar batches of 4,096 rows, enabling SIMD-like processing and avoiding per-row object creation overhead. This is particularly effective for analytical queries that aggregate or filter on a few columns from wide tables.
Data Locality and Caching
On cloud object stores (S3, GCS, ADLS), every file access involves a network round-trip. Spark configurations that reduce unnecessary round-trips make a significant difference:
# Enable S3 request range reads (avoid downloading entire files for projections)
spark.conf.set("spark.hadoop.fs.s3a.experimental.input.fadvise", "random")
# Increase S3 connection pool for parallel reads
spark.conf.set("spark.hadoop.fs.s3a.connection.maximum", "200")
# Enable file metadata caching
spark.conf.set("spark.hadoop.fs.s3a.metadatastore.authoritative", "true")
Spark Write Configurations
Write-side configurations interact with Iceberg table properties to determine the final file layout.
Shuffle and Distribution
| Config | Default | Impact |
|---|---|---|
spark.sql.shuffle.partitions | 200 | Number of Spark partitions after a shuffle. Directly affects the number of write tasks. |
spark.sql.adaptive.enabled | true (Spark 3.2+) | Enables Adaptive Query Execution (AQE), which dynamically adjusts partition counts. |
spark.sql.adaptive.coalescePartitions.enabled | true | AQE merges small shuffle partitions into larger ones. |
spark.sql.adaptive.advisoryPartitionSizeInBytes | 67108864 (64 MB) | AQE target partition size when coalescing. |
How these interact with Iceberg's write.distribution-mode:
When write.distribution-mode = hash, Iceberg adds a shuffle before the write. The number of post-shuffle partitions is controlled by spark.sql.shuffle.partitions (when AQE is off) or dynamically by AQE (when enabled).
Recommendation: Enable AQE and set the advisory partition size to match your target file size:
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "268435456") # 256 MB
This tells AQE to coalesce small post-shuffle partitions until each is approximately 256 MB — aligning with a typical write.target-file-size-bytes of 256 MB. The result: each Spark task writes approximately one well-sized file.
Fanout Writers
For tables with many partitions and where you want to avoid the cost of a full shuffle, Spark supports fanout writers:
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.spark.fanout.enabled' = 'true'
);
With fanout enabled and write.distribution-mode = none, each Spark task keeps one file writer open per partition it encounters. This avoids shuffle but requires enough memory to hold all open writers simultaneously.
| Approach | Shuffle Cost | Memory Usage | File Count |
|---|---|---|---|
distribution-mode=hash | Yes (full shuffle) | Low (one writer per task) | Optimal — one file per task per partition |
distribution-mode=none + fanout=true | None | High (one writer per partition per task) | Optimal — one file per task per partition |
distribution-mode=none + fanout=false | None | Low | Poor — many small files |
Use fanout when: you have a moderate number of partitions (under 100) and want to avoid shuffle overhead. Avoid fanout when: you have thousands of partitions — the memory for thousands of open file writers per task can cause OOM errors.
Write Parallelism
# Control the number of concurrent file writes
spark.conf.set("spark.sql.iceberg.write.max-open-file-cost", "4194304")
# For streaming writes: control commit interval
spark.conf.set("spark.sql.iceberg.write.commit-interval", "60s")
For streaming workloads with Structured Streaming, the commit interval controls how frequently Spark commits a new Iceberg snapshot. Shorter intervals mean fresher data but more snapshots to manage. A 60-second interval is a good default.
Data Layout Optimization
Even with the right table properties and Spark configurations, data that has accumulated over time may have suboptimal layout. Iceberg's rewrite_data_files procedure lets you reorganize data without affecting concurrent readers.
Compaction Strategies
| Strategy | What It Does | When to Use |
|---|---|---|
binpack | Combines small files into larger ones without changing order. | After many small streaming commits. Fast. |
sort | Rewrites all files sorted by the table's defined sort order. | When you need effective min/max statistics. Slower but higher quality. |
sort with zorder | Rewrites files with multi-dimensional interleaving. | Multi-column filter workloads. |
-- Binpack: quick compaction of small files
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'binpack',
options => map(
'target-file-size-bytes', '268435456',
'min-file-size-bytes', '104857600',
'max-file-size-bytes', '536870912'
)
);
-- Sort: full rewrite with optimal data layout
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort',
sort_order => 'user_id ASC NULLS LAST, event_time ASC'
);
-- Z-order: multi-column clustering
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort',
sort_order => 'zorder(user_id, country)'
);
Partial Compaction with Filters
For large tables, you can compact only specific partitions:
-- Compact only the last 7 days
CALL system.rewrite_data_files(
table => 'analytics.events',
where => 'event_time >= TIMESTAMP ''2026-02-05 00:00:00'' AND event_time < TIMESTAMP ''2026-02-12 00:00:00'''
);
Manifest Compaction
Over time, the manifest list can accumulate many small manifests. Rewriting manifests keeps query planning fast:
CALL system.rewrite_manifests(
table => 'analytics.events'
);
This is particularly important after running rewrite_data_files, which can leave behind outdated manifest entries.
Putting It All Together: Performance Tuning Playbook
Here is a step-by-step checklist for tuning an existing Iceberg table:
Step 1: Identify the Query Patterns
What do your queries filter on? Group them:
| Pattern | Optimization |
|---|---|
| Always filter on time | Time-based partitioning (day, month) |
| Always filter on one ID column | Sort by that column + bloom filter |
| Filter on varying columns | Z-order across top 2-3 filter columns |
| Full table scans | Increase read parallelism, smaller split sizes |
| Point lookups on specific IDs | Bloom filters + sort order |
Step 2: Configure Metrics Collection
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.metadata.metrics.default' = 'truncate(16)',
'write.metadata.metrics.column.user_id' = 'full',
'write.metadata.metrics.column.event_time' = 'full',
'write.metadata.metrics.column.payload' = 'none',
'write.metadata.metrics.column.raw_json' = 'none'
);
Step 3: Add Sort Order
ALTER TABLE analytics.events
WRITE ORDERED BY (user_id);
Step 4: Enable Bloom Filters on Lookup Columns
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true'
);
Step 5: Rewrite Data Files
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort'
);
Step 6: Rewrite Manifests
CALL system.rewrite_manifests(
table => 'analytics.events'
);
Step 7: Configure Spark Session
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "268435456")
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "true")
spark.conf.set("spark.hadoop.fs.s3a.experimental.input.fadvise", "random")
spark.conf.set("spark.hadoop.fs.s3a.connection.maximum", "200")
Step 8: Measure and Iterate
After applying changes, measure the improvement:
-- Compare file counts before and after compaction
SELECT
COUNT(*) AS file_count,
ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_file_mb,
ROUND(SUM(file_size_in_bytes) / 1073741824.0, 2) AS total_gb
FROM analytics.events.files;
-- Check column statistics are being collected
SELECT
content,
file_path,
lower_bounds,
upper_bounds
FROM analytics.events.entries
LIMIT 5;
Quick Reference: Property Cheat Sheet
Table Properties for Read Performance
| Goal | Property | Recommended Value |
|---|---|---|
| Efficient file pruning on numbers | write.metadata.metrics.column.<col> | full |
| Efficient file pruning on strings | write.metadata.metrics.column.<col> | truncate(16) or truncate(32) |
| Disable stats for blob columns | write.metadata.metrics.column.<col> | none |
| Point lookup filtering | write.parquet.bloom-filter-enabled.column.<col> | true |
| Bloom filter size | write.parquet.bloom-filter-max-bytes | 1048576 (1 MB default) |
| Split size for parallelism | read.split.target-size | 67108864 (low-latency) to 268435456 (batch) |
Table Properties for Write Performance
| Goal | Property | Recommended Value |
|---|---|---|
| Prevent small files | write.distribution-mode | hash (partitioned) |
| Target file size | write.target-file-size-bytes | 268435456 (256 MB) |
| Modern compression | write.parquet.compression-codec | zstd |
| S3 write distribution | write.object-storage.enabled | true |
| Skip shuffle (few partitions) | write.spark.fanout.enabled | true |
Spark Session Properties
| Goal | Config | Recommended Value |
|---|---|---|
| Enable AQE | spark.sql.adaptive.enabled | true |
| AQE partition coalescing | spark.sql.adaptive.coalescePartitions.enabled | true |
| AQE target partition size | spark.sql.adaptive.advisoryPartitionSizeInBytes | 268435456 |
| Vectorized reads | spark.sql.iceberg.vectorization.enabled | true |
| S3 random reads | spark.hadoop.fs.s3a.experimental.input.fadvise | random |
| S3 connection pool | spark.hadoop.fs.s3a.connection.maximum | 200 |
Common Performance Anti-Patterns
-
No sort order on tables used for point lookups. Without sorting, min/max statistics span the entire value range, making file pruning useless. Always add a sort order for columns frequently used in
WHEREclauses. -
Bloom filters on low-cardinality columns. A bloom filter on a
statuscolumn with 5 distinct values wastes space. Min/max statistics already handle this efficiently. -
Collecting full metrics on all columns. A table with 500 columns does not need statistics on every one. Set
nonefor large, unfiltered columns to keep manifests compact. -
Never running compaction. File pruning only works well when files have narrow, non-overlapping value ranges. Without periodic compaction (sort or z-order), files accumulate with increasingly overlapping ranges.
-
Setting
read.split.target-sizetoo small. Splits under 16 MB create excessive task scheduling overhead. The minimum practical value is 32 MB, and 64-256 MB is the sweet spot. -
Z-ordering more than 4 columns. Z-order effectiveness degrades exponentially with each additional column. Three columns is the practical maximum for most workloads.
-
Ignoring manifest compaction. Thousands of small manifests slow down query planning. Run
rewrite_manifestsperiodically, especially after largerewrite_data_filesoperations.
This post is part of our Apache Iceberg deep-dive series. For table creation guidance, see Iceberg Table Design: Properties, Partitioning, and Commit Best Practices. For file size management, see Mastering Iceberg File Sizes. Check out the full series on our blog.