Skip to main content

Iceberg Query Performance Tuning: Partition Pruning, Bloom Filters, and Spark Configs

· 19 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Query Performance Tuning

Your Iceberg tables are created with the right properties. Your partitions are well-designed. But your queries are still slower than you expected. The dashboard that should load in 3 seconds takes 45. The data scientist's notebook times out. The problem is not your table design — it is that you have not tuned the layers between the query and the data.

Apache Iceberg has a sophisticated query planning pipeline that can skip entire partitions, skip individual files within a partition, and even skip row groups within a file. But each of these layers only works if you configure it correctly. This post walks through every pruning layer, explains exactly how Iceberg uses metadata to skip work, and gives you the Spark configurations to control it all.

The Three Layers of Iceberg Query Pruning

When Spark executes a query against an Iceberg table, the query planner evaluates filters at three progressively finer levels:

Detailed overview of Iceberg's three-layer query pruning pipeline showing partition pruning, file-level data skipping with column statistics, and row group filtering with bloom filters

Query: SELECT * FROM events WHERE event_date = '2026-02-10' AND user_id = 42

Layer 1: Partition Pruning
└── Skip all partitions except event_date = '2026-02-10'
(Eliminates 99.7% of partitions for a year of daily data)

Layer 2: File Pruning (Data Skipping)
└── Within the matching partition, check min/max statistics
in manifest files to skip data files that cannot contain user_id = 42
(Eliminates 60-95% of remaining files if data is sorted)

Layer 3: Row Group Filtering (Bloom Filters)
└── Within each remaining file, check Parquet bloom filters
to skip row groups that definitely do not contain user_id = 42
(Eliminates 80-99% of remaining row groups for point lookups)

Each layer is independent. You can use any combination. But the biggest performance gains come from enabling all three.

Layer 1: Partition Pruning

Partition pruning is the coarsest and most impactful optimization. It eliminates entire directories of data files based on the partition spec.

How It Works

When you create a table with PARTITIONED BY (day(event_time)), Iceberg records the partition transform in the table metadata. At query time, when Spark encounters a filter like WHERE event_time >= '2026-02-01' AND event_time < '2026-02-08', the query planner:

  1. Reads the manifest list (a single small file).
  2. For each manifest, checks the partition summary (min/max partition values stored in the manifest list).
  3. Skips manifests whose partition ranges do not overlap the filter.
  4. For remaining manifests, reads manifest entries and skips files in non-matching partitions.

The key insight: Iceberg understands the relationship between the source column and the partition transform. A filter on event_time automatically prunes partitions defined by day(event_time) — the user never needs to reference the partition column directly. This is the "hidden partitioning" that makes Iceberg fundamentally different from Hive.

What Triggers Partition Pruning

Filter TypePrunes?Example
Equality on source columnYesWHERE event_time = '2026-02-10 14:30:00'
Range on source columnYesWHERE event_time BETWEEN '2026-02-01' AND '2026-02-07'
IN list on source columnYesWHERE event_date IN ('2026-02-10', '2026-02-11')
Equality on partition transformYesWHERE day(event_time) = '2026-02-10'
Function on source columnDependsWHERE year(event_time) = 2026 — yes. WHERE UPPER(region) = 'US' — no.
No filter on partition columnNoWHERE user_id = 42 (if partitioned only by time)
OR across partition columnsPartialWHERE event_date = '2026-02-10' OR user_id = 42 — the OR prevents full pruning.

Verifying Partition Pruning

You can check how many files Spark scans by looking at the query plan:

EXPLAIN SELECT COUNT(*) FROM analytics.events
WHERE event_time >= '2026-02-10'
AND event_time < '2026-02-11';

Look for filters and residualFilter in the output. If the partition filter appears under dataFilters instead of being eliminated at the plan level, pruning may not be working as expected.

You can also check scan metrics after execution:

-- Check how many files were scanned vs total
SELECT
COUNT(*) AS total_files,
SUM(CASE WHEN partition.event_time_day = DATE '2026-02-10' THEN 1 ELSE 0 END) AS target_files
FROM analytics.events.files;

Layer 2: File Pruning with Column Statistics

Even after partition pruning narrows the scan to a single day, you may still have thousands of files in that partition. File pruning (also called data skipping) uses min/max column statistics stored in manifest files to skip individual data files.

How Column Statistics Work

Every time Iceberg writes a data file, it records statistics for each column:

StatisticWhat It StoresUsed For
value_countsNumber of non-null values per columnCardinality estimation
null_value_countsNumber of null values per columnIS NULL / IS NOT NULL filters
nan_value_countsNumber of NaN values (floats)NaN handling
lower_boundsMinimum value per columnRange filter pruning
upper_boundsMaximum value per columnRange filter pruning

When Spark evaluates WHERE user_id = 42, the query planner checks each file's manifest entry:

  • If the file's lower_bound(user_id) is 100 and upper_bound(user_id) is 500, user_id 42 cannot be in this file — skip it.
  • If the file's lower_bound(user_id) is 1 and upper_bound(user_id) is 1000, user_id 42 might be in this file — read it.

Metrics Collection Modes

Iceberg supports four modes for collecting column statistics, controlled by the write.metadata.metrics.default table property:

ModeWhat It CollectsStorage OverheadBest For
noneNothingZeroColumns never used in filters (large BLOBs, JSON strings).
countsValue counts and null counts onlyMinimalColumns used for IS NULL checks but not range filters.
truncate(N)Counts + truncated min/max (first N bytes)LowString columns. Default is truncate(16).
fullCounts + full min/max valuesModerateNumeric columns, timestamps, and short strings used in filters.

The default is truncate(16) — which means Iceberg stores the first 16 bytes of the min and max value for string columns. For numeric columns (INT, LONG, DOUBLE, DECIMAL, TIMESTAMP), truncate(16) effectively stores the full value since these types are all 8 bytes or less.

Configuring Metrics Per Column

This is one of the most underused Iceberg features. You can set different metrics modes for different columns:

ALTER TABLE analytics.events
SET TBLPROPERTIES (
-- Default: truncate at 16 bytes (good for most columns)
'write.metadata.metrics.default' = 'truncate(16)',

-- Full metrics for columns used in range filters
'write.metadata.metrics.column.user_id' = 'full',
'write.metadata.metrics.column.event_time' = 'full',
'write.metadata.metrics.column.amount' = 'full',

-- No metrics for large payload columns (saves manifest space)
'write.metadata.metrics.column.payload' = 'none',
'write.metadata.metrics.column.raw_json' = 'none',
'write.metadata.metrics.column.properties' = 'none'
);

Why this matters: Every column's min/max statistics are stored in the manifest file. If your table has 200 columns, manifest files can become bloated with statistics for columns that are never filtered on. Setting none for large, unfiltered columns (JSON blobs, raw text, maps) reduces manifest size and speeds up query planning.

Why Sort Order Makes Statistics Effective

Here is the critical insight: min/max statistics are only useful when data is sorted or at least clustered.

Consider a file with 1 million rows of user_id values:

Unsorted file:

  • lower_bound(user_id) = 1
  • upper_bound(user_id) = 10,000,000
  • A filter WHERE user_id = 42 matches this range — the file cannot be skipped

Sorted file (user_ids 1 through 1,000):

  • lower_bound(user_id) = 1
  • upper_bound(user_id) = 1,000
  • A filter WHERE user_id = 5,000 does not match this range — the file is skipped

When data is randomly distributed across files, every file's min/max range spans the entire domain, making statistics useless. When data is sorted, each file covers a narrow, non-overlapping range, and statistics become highly selective.

-- Add a sort order to make file pruning effective
ALTER TABLE analytics.events
WRITE ORDERED BY (user_id);

-- Rewrite existing data to apply the sort
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort'
);

After sorting, a query like WHERE user_id = 42 might scan 2-3 files instead of 500 — a 100x improvement from sort order alone.

Z-Order: Multi-Column Sort for Multi-Column Filters

Standard sorting works well when queries filter on a single column. But what if queries filter on different columns interchangeably — sometimes user_id, sometimes country, sometimes event_type?

Z-ordering (also called multi-dimensional clustering) interleaves the sort across multiple columns, so that each file covers a small range in all dimensions simultaneously:

CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort',
sort_order => 'zorder(user_id, country, event_type)'
);
StrategyBest ForTrade-off
Linear sort (ORDERED BY col)Queries always filter on the same columnPerfect pruning for that column, poor for others.
Z-order (zorder(col1, col2, col3))Queries filter on varying columnsGood pruning for all columns, not perfect for any single one.

Recommendation: Use linear sort when 80%+ of queries filter on the same column. Use z-order when query patterns are diverse. Do not z-order more than 3-4 columns — effectiveness drops significantly beyond that.

Layer 3: Bloom Filters for Row Group Filtering

The finest level of pruning happens inside individual Parquet files, at the row group level. Parquet bloom filters are probabilistic data structures that can definitively say "this value is NOT in this row group" — allowing the reader to skip row groups entirely.

How Bloom Filters Work

A bloom filter is a bit array with hash functions. When writing a row group, Iceberg hashes every value in the column and sets bits in the filter. At read time, the query engine hashes the filter value and checks the bits:

  • All bits set → The value might be in this row group (read it).
  • Any bit not set → The value is definitely not in this row group (skip it).

Bloom filters have a false positive rate (they sometimes say "might be here" when the value is actually absent) but never have false negatives (they never skip a row group that actually contains the value).

Enabling Bloom Filters

Bloom filters are disabled by default. You enable them per column:

ALTER TABLE analytics.events
SET TBLPROPERTIES (
-- Enable bloom filters for high-cardinality lookup columns
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true',
'write.parquet.bloom-filter-enabled.column.event_id' = 'true',

-- Control bloom filter size (default is 1MB per column per row group)
'write.parquet.bloom-filter-max-bytes' = '1048576'
);

Important: Bloom filters only help with equality predicates (WHERE user_id = 42). They do not help with range queries (WHERE user_id > 100 AND user_id < 200). For range queries, min/max statistics and sort order are the right tools.

Which Columns Should Have Bloom Filters?

Column TypeBloom Filter?Why
High-cardinality ID columns (user_id, session_id, order_id)YesPoint lookups on high-cardinality columns benefit most.
Low-cardinality columns (status, country, event_type)NoOnly a few distinct values — min/max statistics are sufficient.
Columns used only in range filtersNoBloom filters only work for equality checks.
Columns already used as partition keysNoPartition pruning already handles these.
Very wide string columns (URLs, JSON)NoLarge values increase bloom filter false positive rates and size.

Storage Overhead

Each bloom filter adds approximately 1 MB per column per row group (configurable via write.parquet.bloom-filter-max-bytes). For a 256 MB file with 2 row groups and bloom filters on 3 columns:

Overhead = 2 row groups × 3 columns × 1 MB = 6 MB (2.3% of file size)

This is a negligible overhead for the 80-99% row group skip rate that bloom filters provide on point lookups.

Bloom Filters + Sort Order: The Power Combination

The best performance comes from combining sort order with bloom filters:

  1. Sort order → Files have narrow min/max ranges → File pruning skips 80-95% of files.
  2. Bloom filters → Within remaining files, row groups are skipped → Row group filtering skips 80-99% of remaining row groups.

The compound effect: a point lookup on a sorted, bloom-filtered table might scan 0.01% of the total data.

-- Optimal configuration for point-lookup heavy tables
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true'
);

ALTER TABLE analytics.events
WRITE ORDERED BY (user_id);

-- Rewrite data to apply sort and generate bloom filters
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort'
);

Spark Read Configurations

Beyond table properties, Spark has its own configurations that affect how it reads Iceberg tables.

Split Planning

Spark divides the files to be read into splits — the unit of parallelism. Each split becomes one Spark task.

ConfigDefaultWhat It Does
read.split.target-size134217728 (128 MB)Target size per split. Controls how many Spark tasks are created.
read.split.metadata-target-size33554432 (32 MB)Target size for splits when reading metadata tables.
read.split.open-file-cost4194304 (4 MB)Estimated overhead for opening a new file. Iceberg adds this to file sizes when bin-packing files into splits.
read.split.planning-lookback10Number of splits to look back when combining small files into a single split.

Key recommendations:

  • Reduce read.split.target-size to 64 MB for latency-sensitive queries. Smaller splits create more Spark tasks, which means more parallelism. This trades cluster resource usage for faster individual query times.

  • Increase read.split.target-size to 256-512 MB for large batch scans. Fewer, larger splits reduce task scheduling overhead and are more efficient for full-table analytics.

-- Set at table level (affects all readers of this table)
ALTER TABLE analytics.events
SET TBLPROPERTIES (
'read.split.target-size' = '67108864'
);

Or set at session level in Spark:

spark.conf.set("spark.sql.iceberg.split-size", "67108864")

Vectorized Reads

Spark can read Parquet files using vectorized (columnar batch) processing, which is significantly faster than row-based reads:

# Ensure vectorized reads are enabled (default in modern Spark)
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "true")

Vectorized reads process data in columnar batches of 4,096 rows, enabling SIMD-like processing and avoiding per-row object creation overhead. This is particularly effective for analytical queries that aggregate or filter on a few columns from wide tables.

Data Locality and Caching

On cloud object stores (S3, GCS, ADLS), every file access involves a network round-trip. Spark configurations that reduce unnecessary round-trips make a significant difference:

# Enable S3 request range reads (avoid downloading entire files for projections)
spark.conf.set("spark.hadoop.fs.s3a.experimental.input.fadvise", "random")

# Increase S3 connection pool for parallel reads
spark.conf.set("spark.hadoop.fs.s3a.connection.maximum", "200")

# Enable file metadata caching
spark.conf.set("spark.hadoop.fs.s3a.metadatastore.authoritative", "true")

Spark Write Configurations

Write-side configurations interact with Iceberg table properties to determine the final file layout.

Shuffle and Distribution

ConfigDefaultImpact
spark.sql.shuffle.partitions200Number of Spark partitions after a shuffle. Directly affects the number of write tasks.
spark.sql.adaptive.enabledtrue (Spark 3.2+)Enables Adaptive Query Execution (AQE), which dynamically adjusts partition counts.
spark.sql.adaptive.coalescePartitions.enabledtrueAQE merges small shuffle partitions into larger ones.
spark.sql.adaptive.advisoryPartitionSizeInBytes67108864 (64 MB)AQE target partition size when coalescing.

How these interact with Iceberg's write.distribution-mode:

When write.distribution-mode = hash, Iceberg adds a shuffle before the write. The number of post-shuffle partitions is controlled by spark.sql.shuffle.partitions (when AQE is off) or dynamically by AQE (when enabled).

Recommendation: Enable AQE and set the advisory partition size to match your target file size:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "268435456") # 256 MB

This tells AQE to coalesce small post-shuffle partitions until each is approximately 256 MB — aligning with a typical write.target-file-size-bytes of 256 MB. The result: each Spark task writes approximately one well-sized file.

Fanout Writers

For tables with many partitions and where you want to avoid the cost of a full shuffle, Spark supports fanout writers:

ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.spark.fanout.enabled' = 'true'
);

With fanout enabled and write.distribution-mode = none, each Spark task keeps one file writer open per partition it encounters. This avoids shuffle but requires enough memory to hold all open writers simultaneously.

ApproachShuffle CostMemory UsageFile Count
distribution-mode=hashYes (full shuffle)Low (one writer per task)Optimal — one file per task per partition
distribution-mode=none + fanout=trueNoneHigh (one writer per partition per task)Optimal — one file per task per partition
distribution-mode=none + fanout=falseNoneLowPoor — many small files

Use fanout when: you have a moderate number of partitions (under 100) and want to avoid shuffle overhead. Avoid fanout when: you have thousands of partitions — the memory for thousands of open file writers per task can cause OOM errors.

Write Parallelism

# Control the number of concurrent file writes
spark.conf.set("spark.sql.iceberg.write.max-open-file-cost", "4194304")

# For streaming writes: control commit interval
spark.conf.set("spark.sql.iceberg.write.commit-interval", "60s")

For streaming workloads with Structured Streaming, the commit interval controls how frequently Spark commits a new Iceberg snapshot. Shorter intervals mean fresher data but more snapshots to manage. A 60-second interval is a good default.

Data Layout Optimization

Even with the right table properties and Spark configurations, data that has accumulated over time may have suboptimal layout. Iceberg's rewrite_data_files procedure lets you reorganize data without affecting concurrent readers.

Compaction Strategies

StrategyWhat It DoesWhen to Use
binpackCombines small files into larger ones without changing order.After many small streaming commits. Fast.
sortRewrites all files sorted by the table's defined sort order.When you need effective min/max statistics. Slower but higher quality.
sort with zorderRewrites files with multi-dimensional interleaving.Multi-column filter workloads.
-- Binpack: quick compaction of small files
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'binpack',
options => map(
'target-file-size-bytes', '268435456',
'min-file-size-bytes', '104857600',
'max-file-size-bytes', '536870912'
)
);

-- Sort: full rewrite with optimal data layout
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort',
sort_order => 'user_id ASC NULLS LAST, event_time ASC'
);

-- Z-order: multi-column clustering
CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort',
sort_order => 'zorder(user_id, country)'
);

Partial Compaction with Filters

For large tables, you can compact only specific partitions:

-- Compact only the last 7 days
CALL system.rewrite_data_files(
table => 'analytics.events',
where => 'event_time >= TIMESTAMP ''2026-02-05 00:00:00'' AND event_time < TIMESTAMP ''2026-02-12 00:00:00'''
);

Manifest Compaction

Over time, the manifest list can accumulate many small manifests. Rewriting manifests keeps query planning fast:

CALL system.rewrite_manifests(
table => 'analytics.events'
);

This is particularly important after running rewrite_data_files, which can leave behind outdated manifest entries.

Putting It All Together: Performance Tuning Playbook

Here is a step-by-step checklist for tuning an existing Iceberg table:

Step 1: Identify the Query Patterns

What do your queries filter on? Group them:

PatternOptimization
Always filter on timeTime-based partitioning (day, month)
Always filter on one ID columnSort by that column + bloom filter
Filter on varying columnsZ-order across top 2-3 filter columns
Full table scansIncrease read parallelism, smaller split sizes
Point lookups on specific IDsBloom filters + sort order

Step 2: Configure Metrics Collection

ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.metadata.metrics.default' = 'truncate(16)',
'write.metadata.metrics.column.user_id' = 'full',
'write.metadata.metrics.column.event_time' = 'full',
'write.metadata.metrics.column.payload' = 'none',
'write.metadata.metrics.column.raw_json' = 'none'
);

Step 3: Add Sort Order

ALTER TABLE analytics.events
WRITE ORDERED BY (user_id);

Step 4: Enable Bloom Filters on Lookup Columns

ALTER TABLE analytics.events
SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true'
);

Step 5: Rewrite Data Files

CALL system.rewrite_data_files(
table => 'analytics.events',
strategy => 'sort'
);

Step 6: Rewrite Manifests

CALL system.rewrite_manifests(
table => 'analytics.events'
);

Step 7: Configure Spark Session

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "268435456")
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "true")
spark.conf.set("spark.hadoop.fs.s3a.experimental.input.fadvise", "random")
spark.conf.set("spark.hadoop.fs.s3a.connection.maximum", "200")

Step 8: Measure and Iterate

After applying changes, measure the improvement:

-- Compare file counts before and after compaction
SELECT
COUNT(*) AS file_count,
ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_file_mb,
ROUND(SUM(file_size_in_bytes) / 1073741824.0, 2) AS total_gb
FROM analytics.events.files;

-- Check column statistics are being collected
SELECT
content,
file_path,
lower_bounds,
upper_bounds
FROM analytics.events.entries
LIMIT 5;

Quick Reference: Property Cheat Sheet

Table Properties for Read Performance

GoalPropertyRecommended Value
Efficient file pruning on numberswrite.metadata.metrics.column.<col>full
Efficient file pruning on stringswrite.metadata.metrics.column.<col>truncate(16) or truncate(32)
Disable stats for blob columnswrite.metadata.metrics.column.<col>none
Point lookup filteringwrite.parquet.bloom-filter-enabled.column.<col>true
Bloom filter sizewrite.parquet.bloom-filter-max-bytes1048576 (1 MB default)
Split size for parallelismread.split.target-size67108864 (low-latency) to 268435456 (batch)

Table Properties for Write Performance

GoalPropertyRecommended Value
Prevent small fileswrite.distribution-modehash (partitioned)
Target file sizewrite.target-file-size-bytes268435456 (256 MB)
Modern compressionwrite.parquet.compression-codeczstd
S3 write distributionwrite.object-storage.enabledtrue
Skip shuffle (few partitions)write.spark.fanout.enabledtrue

Spark Session Properties

GoalConfigRecommended Value
Enable AQEspark.sql.adaptive.enabledtrue
AQE partition coalescingspark.sql.adaptive.coalescePartitions.enabledtrue
AQE target partition sizespark.sql.adaptive.advisoryPartitionSizeInBytes268435456
Vectorized readsspark.sql.iceberg.vectorization.enabledtrue
S3 random readsspark.hadoop.fs.s3a.experimental.input.fadviserandom
S3 connection poolspark.hadoop.fs.s3a.connection.maximum200

Common Performance Anti-Patterns

  1. No sort order on tables used for point lookups. Without sorting, min/max statistics span the entire value range, making file pruning useless. Always add a sort order for columns frequently used in WHERE clauses.

  2. Bloom filters on low-cardinality columns. A bloom filter on a status column with 5 distinct values wastes space. Min/max statistics already handle this efficiently.

  3. Collecting full metrics on all columns. A table with 500 columns does not need statistics on every one. Set none for large, unfiltered columns to keep manifests compact.

  4. Never running compaction. File pruning only works well when files have narrow, non-overlapping value ranges. Without periodic compaction (sort or z-order), files accumulate with increasingly overlapping ranges.

  5. Setting read.split.target-size too small. Splits under 16 MB create excessive task scheduling overhead. The minimum practical value is 32 MB, and 64-256 MB is the sweet spot.

  6. Z-ordering more than 4 columns. Z-order effectiveness degrades exponentially with each additional column. Three columns is the practical maximum for most workloads.

  7. Ignoring manifest compaction. Thousands of small manifests slow down query planning. Run rewrite_manifests periodically, especially after large rewrite_data_files operations.


This post is part of our Apache Iceberg deep-dive series. For table creation guidance, see Iceberg Table Design: Properties, Partitioning, and Commit Best Practices. For file size management, see Mastering Iceberg File Sizes. Check out the full series on our blog.