Iceberg Bloom Filters with Spark: Configuration, Validation, and Performance Guide
When you query an Iceberg table with WHERE user_id = 'abc-123', Spark reads every Parquet file that could contain that value. It first checks partition pruning — does this file belong to the right partition? Then it checks column statistics — does the min/max range for user_id in this file include 'abc-123'? But for high-cardinality columns like UUIDs, user IDs, session IDs, or trace IDs, min/max statistics are nearly useless. The min might be 'aaa...' and the max might be 'zzz...', so every file passes the min/max check even though only one file actually contains the value.
This is where bloom filters come in. A bloom filter is a compact probabilistic data structure embedded in each Parquet file that can definitively say "this value is NOT in this file" — allowing Spark to skip the file entirely. For point lookups on high-cardinality columns, bloom filters can reduce I/O by 80-90%.
This post covers everything you need to know: how bloom filters work internally, when to use them, how to configure them on Iceberg tables, how to validate they are present in your Parquet files, and what false positives mean for your data correctness.
How Bloom Filters Work
A bloom filter is a bit array combined with a set of hash functions. When you insert a value, the hash functions map it to several positions in the bit array and set those bits to 1. When you check whether a value exists, the same hash functions are applied. If all the corresponding bits are 1, the value might exist (maybe). If any bit is 0, the value is definitely not present (certain).
This gives bloom filters two critical properties:
- No false negatives. If the bloom filter says "not present", the value is guaranteed to be absent. Your query will never miss data because of a bloom filter.
- Possible false positives. If the bloom filter says "maybe present", the value might or might not be there. Spark will read the row group to check — it just cannot skip it.
Parquet's Split Block Bloom Filter
Parquet uses a specific implementation called Split Block Bloom Filter (SBBF). Here is how it works:
- The filter is divided into blocks, each 256 bits (32 bytes).
- Each block contains eight 32-bit words.
- A filter consists of
zblocks, wherezdepends on the configured maximum size. - Parquet uses 8 hash functions (designed to fit cleanly into SIMD lanes for hardware acceleration).
- The hash function is xxHash (XXH64) with seed 0.
When a value is checked against the bloom filter:
- The value is hashed with XXH64 to produce a 64-bit hash.
- The hash determines which block to check (lower 32 bits mod
z). - Within that block, the 8 hash functions determine 8 bit positions.
- If all 8 bits are set, the value "maybe exists". If any bit is 0, the value "definitely does not exist".
Where Bloom Filters are Stored
Bloom filters are stored in the Parquet file footer metadata — one bloom filter per column per row group. When Spark opens a Parquet file, it reads the footer (which is small — typically a few KB to a few MB) and checks the bloom filter before deciding whether to read the row group data.
This is extremely efficient: Spark loads roughly 1 MB of bloom filter data (the default maximum) to determine whether it needs to read potentially hundreds of MB of actual column data.
Will Bloom Filters Ever Miss Data?
No. Bloom filters will never cause your query to return incorrect or incomplete results.
This is the most important point to understand. A bloom filter can only produce false positives (saying "maybe present" when the value is absent), never false negatives (saying "not present" when the value is actually there).
What this means in practice:
| Bloom filter says | Actual state | What happens |
|---|---|---|
| "Not present" | Value is absent | Row group skipped (correct, saves I/O) |
| "Not present" | Value is present | Impossible — this never happens |
| "Maybe present" | Value is present | Row group is read (correct) |
| "Maybe present" | Value is absent | Row group is read unnecessarily (false positive, wastes some I/O) |
A false positive means Spark reads a row group it did not need to read — the same thing that would happen without a bloom filter at all. The worst case is that the bloom filter adds no benefit for a particular query, not that it causes wrong results.
False Positive Rate
The false positive probability (FPP) depends on:
- Filter size: More bytes = more bits = lower FPP.
- Number of distinct values inserted: More values = more bits set = higher FPP.
- Number of hash functions: Fixed at 8 in Parquet's SBBF.
A concrete example from the Parquet specification: a filter with 1,024 blocks (32 KB) that has 26,214 distinct values inserted will have a false positive probability of approximately 1.26%. This means that for values that are NOT in the file, 98.74% of the time the bloom filter will correctly identify them as absent and skip the row group.
Iceberg defaults to a maximum bloom filter size of 1 MB per column per row group, which provides a very low false positive rate for most real-world cardinalities.
When to Use Bloom Filters
Bloom filters shine in specific scenarios. They are not a universal optimization.
Ideal Use Cases
1. Point lookups on high-cardinality columns.
Columns like user_id, session_id, trace_id, request_id, order_id, or UUIDs have millions or billions of distinct values. Min/max statistics are useless for these columns because the range spans nearly the entire value space. Bloom filters provide the only effective file-level filtering for equality predicates on these columns.
-- Bloom filter helps: high-cardinality, equality predicate
SELECT * FROM events WHERE user_id = 'a1b2c3d4-e5f6-7890';
SELECT * FROM traces WHERE trace_id = '0af7651916cd43dd8448eb211c80319c';
SELECT * FROM orders WHERE order_id = 987654321;
2. Lookups where the value is often absent.
Bloom filters are most effective when most files do NOT contain the searched value — the filter skips those files entirely. For a table with 1,000 Parquet files where only 1 file contains the target user_id, bloom filters skip 999 files.
3. Columns used in JOIN conditions (non-partitioned).
When Spark performs a sort-merge join or shuffled hash join, it reads row groups from both tables. Bloom filters on the join column can skip row groups that have no matching keys, reducing the data read during the join.
4. CDC/MERGE INTO target tables.
During a MERGE INTO, Spark joins the source against the target table. If the target table has bloom filters on the merge key, Spark can skip target row groups that have no matching source keys. This complements the push-down predicate optimization covered in our MERGE INTO blog.
5. Multi-tenant tables filtered by tenant_id.
If you store data for many tenants in a single table partitioned by date, a bloom filter on tenant_id helps skip row groups that do not contain the queried tenant — even within the same partition.
When NOT to Use Bloom Filters
1. Low-cardinality columns.
Columns like status (3 values), country (200 values), or is_active (2 values) have very few distinct values. Min/max statistics already work well for these — and a bloom filter would almost always say "maybe present" because every file contains most values. The filter just wastes space.
2. Columns already used in partition transforms.
If the table is partitioned by day(event_time) and you filter by event_time, partition pruning already eliminates files at the directory level. Adding a bloom filter on event_time provides minimal additional benefit.
3. Range queries.
Bloom filters only work with equality predicates (=, IN). Range queries (>, <, BETWEEN) cannot use bloom filters — they rely on min/max statistics and column indexes instead.
-- Bloom filter CANNOT help with range queries
SELECT * FROM events WHERE created_at > '2026-02-01';
SELECT * FROM orders WHERE amount BETWEEN 100 AND 500;
4. Columns with sorted data.
If data is sorted by a column (via Iceberg's write.sort-order), min/max statistics are already highly effective because each file contains a narrow, non-overlapping range. A bloom filter adds overhead without benefit.
Which SQL Predicates Work with Bloom Filters?
Not all predicates can leverage bloom filters. Here is the complete picture:
Equality (=) — Works
Single-column equality is the primary use case. The value is hashed and checked against the bloom filter directly.
-- Bloom filter kicks in
SELECT * FROM events WHERE user_id = 'abc-123';
IN Clause — Works
An IN clause is decomposed into multiple equality checks. Spark checks each value against the bloom filter. If the bloom filter says "not present" for every value in the list, the row group is skipped. If any value "maybe exists", the row group is read.
-- Bloom filter checks each value: 'abc', 'def', 'ghi'
SELECT * FROM events WHERE user_id IN ('abc', 'def', 'ghi');
This means IN with a small number of values works well. With a very large IN list (hundreds of values), the probability that at least one value triggers a "maybe present" increases, reducing the skip rate.
AND Conditions — Works
When both sides of an AND have bloom-filter-enabled columns, each side is evaluated independently. If either bloom filter says "not present", the row group is skipped.
-- Both bloom filters are checked; either can skip the row group
SELECT * FROM events
WHERE user_id = 'abc-123'
AND session_id = 'sess-456';
OR Conditions — Limited (Known Issue)
This is a known limitation. When you use OR across different columns, Iceberg's row group filtering does not properly combine bloom filter results with dictionary and min/max evaluators (GitHub #10029).
The root cause: Iceberg evaluates bloom filters, dictionary filters, and min/max statistics as three independent evaluators on the full OR expression, then ANDs the results. If one evaluator cannot rule out a column (e.g., no bloom filter or no dictionary for one of the OR branches), it returns true, which overrides a bloom filter that definitively excluded the row group.
-- Bloom filter on user_id may NOT help here due to OR evaluation bug
SELECT * FROM events
WHERE user_id = 'abc-123'
OR event_type = 'purchase';
In the reported case, this caused a query to take 396 seconds instead of 11 seconds because row groups that could have been skipped were read unnecessarily.
Workaround: Rewrite OR queries as UNION ALL of two separate queries, each with a single equality predicate:
-- Workaround: split OR into UNION ALL so bloom filter works on each branch
SELECT * FROM events WHERE user_id = 'abc-123'
UNION ALL
SELECT * FROM events WHERE event_type = 'purchase'
AND user_id != 'abc-123'; -- deduplicate
NOT Equal (!=, <>) — Does Not Work
Bloom filters can only test for set membership ("is this value present?"), not for exclusion.
-- Bloom filter CANNOT help
SELECT * FROM events WHERE user_id != 'abc-123';
Range Predicates (>, <, >=, <=, BETWEEN) — Does Not Work
Bloom filters are hash-based and have no concept of ordering. Range predicates rely on min/max statistics and column indexes instead.
IS NULL / IS NOT NULL — Does Not Work
Null testing uses separate Parquet null count statistics, not bloom filters.
Predicate Summary
| Predicate | Bloom filter used? | Notes |
|---|---|---|
col = value | Yes | Primary use case |
col IN (v1, v2, v3) | Yes | Each value checked; skip if all absent |
col1 = v1 AND col2 = v2 | Yes | Both filters evaluated independently |
col1 = v1 OR col2 = v2 | Limited | Known issue — may not skip row groups correctly |
col != value | No | Cannot test exclusion |
col > value | No | Use min/max stats instead |
col BETWEEN v1 AND v2 | No | Use min/max stats instead |
col IS NULL | No | Uses null count statistics |
col LIKE 'pattern%' | No | Not an equality predicate |
How to Configure Bloom Filters on Iceberg Tables
Enable Per Column
Bloom filters must be enabled per column. There is no global "enable bloom filters on all columns" setting — this is intentional because bloom filters add file size overhead.
-- Enable bloom filters on specific columns
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true',
'write.parquet.bloom-filter-enabled.column.trace_id' = 'true'
);
Or at table creation time:
CREATE TABLE analytics.events (
event_id BIGINT,
user_id STRING,
session_id STRING,
trace_id STRING,
event_time TIMESTAMP,
event_type STRING,
payload STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true',
'write.parquet.bloom-filter-enabled.column.trace_id' = 'true'
);
Configure Filter Size
The maximum size of the bloom filter bitset controls the false positive rate. Larger filters = lower false positive rate but larger file sizes.
ALTER TABLE analytics.events SET TBLPROPERTIES (
-- Maximum bloom filter size per column per row group (default: 1 MB)
'write.parquet.bloom-filter-max-bytes' = '1048576'
);
Sizing guidance:
| Distinct values per row group | Recommended max-bytes | Approximate FPP |
|---|---|---|
| < 100,000 | 128 KB (131072) | < 0.1% |
| 100,000 - 1,000,000 | 256 KB - 512 KB | ~0.5-1% |
| 1,000,000 - 10,000,000 | 1 MB (default) | ~1-2% |
| > 10,000,000 | 2-4 MB | ~1-2% |
For most workloads, the 1 MB default provides an excellent balance between file size overhead and filter accuracy.
Configure False Positive Probability (Per Column)
Iceberg allows setting a target false positive probability per column:
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.parquet.bloom-filter-fpp.column.user_id' = '0.01',
'write.parquet.bloom-filter-fpp.column.trace_id' = '0.001'
);
Important implementation note: In the current Iceberg Parquet writer, the bloom-filter-max-bytes setting takes precedence. The writer allocates exactly bloom-filter-max-bytes for the filter and does not dynamically size the filter based on FPP and NDV (number of distinct values). This means the FPP property serves as a hint but the actual false positive rate is determined by the fixed filter size and the number of distinct values inserted.
Full Configuration Reference
| Property | Default | Description |
|---|---|---|
write.parquet.bloom-filter-enabled.column.<col> | false | Enable bloom filter for a specific column |
write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | Maximum bytes for the bloom filter bitset per column per row group |
write.parquet.bloom-filter-fpp.column.<col> | 0.01 | Target false positive probability (hint — actual FPP depends on filter size and NDV) |
How Spark Uses Bloom Filters at Read Time
Understanding the read path helps you reason about when bloom filters will help:
The Three-Layer Filtering Pipeline
When Spark reads an Iceberg table with a WHERE clause, filtering happens in three layers:
Layer 1: Partition pruning. Iceberg's metadata eliminates entire partitions that cannot match the predicate. This is the coarsest filter — it operates at the file/partition level.
Layer 2: Min/max statistics (column statistics). For each data file, Iceberg stores the min and max value of each column. Files where the predicate value falls outside the min/max range are skipped.
Layer 3: Bloom filters (row group level). For files that passed layers 1 and 2, Spark opens the Parquet file and checks the bloom filter in each row group's footer. Row groups where the bloom filter says "not present" are skipped.
1000 files in table
→ Partition pruning: 30 files remain
→ Min/max statistics: 15 files remain
→ Bloom filter: 1 file actually read
For high-cardinality columns, min/max statistics eliminate very few files (the range is too wide). Bloom filters fill this gap by providing precise set-membership testing.
What Spark Reads
- Spark reads the Parquet file footer (a few KB at the end of the file).
- The footer contains the bloom filter offset and length for each column in each row group.
- Spark reads the bloom filter data (up to 1 MB per column) from the specified offset.
- Spark hashes the filter value and checks the bloom filter.
- If the bloom filter says "not present", the entire row group is skipped — Spark does not read any column data from that row group.
This means the I/O cost of checking a bloom filter is: footer read + bloom filter read ≈ 1-2 MB per file, versus potentially hundreds of MB of column data if the row group is not skipped.
Performance Impact
Benchmarks
Testing from CERN's database team with Spark 3.5 and Parquet 1.13.1 on a 1-million-row dataset showed:
| Metric | Without bloom filter | With bloom filter | Improvement |
|---|---|---|---|
| Bytes read (absent value) | 8,299,656 | 1,091,611 | 87% reduction (7.6x) |
| File size | 8,010,077 | 10,107,275 | +26% overhead |
Key observations:
- For values NOT in the file, bloom filters reduced I/O by 87% — Spark skipped the row group entirely after checking the filter.
- File size increased by 26% due to the bloom filter data stored in the footer. For production tables with the 1 MB default max-bytes, the overhead is typically 1-5% of total file size for 128-256 MB files.
- For values that ARE in the file, there is no improvement — Spark reads the row group regardless. The bloom filter check adds negligible overhead (microseconds to check a bit array).
Real-World Performance Expectations
The actual speedup depends on your data distribution:
| Scenario | Expected improvement |
|---|---|
| Point lookup, value in 1 of 1000 files | ~1000x fewer files read |
| Point lookup, value in 1 of 100 files | ~100x fewer files read |
IN clause with 10 values across 1000 files | ~10-100x fewer files read |
| Point lookup, value in most files | No improvement (false positives) |
Range query (BETWEEN, >, <) | No improvement (bloom filter not applicable) |
The biggest wins come when you are searching for a needle in a haystack — one specific value across thousands of files.
How to Validate Bloom Filters in Parquet Files
After enabling bloom filters, you need to verify they are actually present in the written Parquet files. Here are three methods.
Method 1: parquet-cli (Recommended)
The Apache Parquet CLI tool can inspect bloom filter metadata directly.
Install parquet-cli:
# Download the parquet-cli JAR (adjust version as needed)
wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.4/parquet-cli-1.14.4-runtime.jar
Check footer for bloom filter offsets:
java -jar parquet-cli-1.14.4-runtime.jar footer /path/to/datafile.parquet
The output will show bloom filter offset and length for each column in each row group. If a column has a bloom filter, you will see non-null offset and length values.
Test a specific value against the bloom filter:
# Check if a value exists in the bloom filter for column 'user_id'
java -jar parquet-cli-1.14.4-runtime.jar bloom-filter \
-c user_id \
-v "a1b2c3d4-e5f6-7890" \
/path/to/datafile.parquet
The output will be one of:
column user_id has no bloom filter— bloom filter is NOT configured for this column.value a1b2c3d4-e5f6-7890 NOT exists— value is definitively absent from this row group.value a1b2c3d4-e5f6-7890 maybe exists— value might be present (positive or false positive).
Method 2: PySpark — Read Parquet Metadata
You can inspect Parquet file metadata programmatically using PySpark with the pyarrow library:
import pyarrow.parquet as pq
# Read the Parquet file metadata
parquet_file = pq.ParquetFile("/path/to/datafile.parquet")
metadata = parquet_file.metadata
print(f"Number of row groups: {metadata.num_row_groups}")
for rg_idx in range(metadata.num_row_groups):
rg = metadata.row_group(rg_idx)
print(f"\nRow group {rg_idx}: {rg.num_rows} rows")
for col_idx in range(rg.num_columns):
col = rg.column(col_idx)
print(f" Column: {col.path_in_schema}")
print(f" Bloom filter offset: {col.bloom_filter_offset}")
print(f" Bloom filter length: {col.bloom_filter_length}")
If bloom_filter_offset is not None (or not -1), a bloom filter exists for that column in that row group.
Method 3: Spark SQL — Check File Size Difference
A quick heuristic: write the same data with and without bloom filters and compare file sizes.
# Write without bloom filter
spark.sql("""
CREATE TABLE test.no_bloom (user_id STRING, data STRING)
USING iceberg
""")
spark.sql("INSERT INTO test.no_bloom SELECT * FROM source_data")
# Write with bloom filter
spark.sql("""
CREATE TABLE test.with_bloom (user_id STRING, data STRING)
USING iceberg
TBLPROPERTIES ('write.parquet.bloom-filter-enabled.column.user_id' = 'true')
""")
spark.sql("INSERT INTO test.with_bloom SELECT * FROM source_data")
# Compare file sizes on S3
aws s3 ls --recursive s3://bucket/warehouse/test/no_bloom/data/ --summarize
aws s3 ls --recursive s3://bucket/warehouse/test/with_bloom/data/ --summarize
Files with bloom filters will be larger. The overhead depends on bloom-filter-max-bytes and the number of bloom-filter-enabled columns. Typically 1-5% larger for 128-256 MB files with 1 MB bloom filters.
Method 4: Spark EXPLAIN — Check Row Group Filtering
After enabling bloom filters, verify they are being used at query time:
# Enable verbose Spark metrics
spark.conf.set("spark.sql.parquet.filterPushdown", "true")
# Run a point lookup and check the metrics
df = spark.sql("""
SELECT * FROM analytics.events
WHERE user_id = 'a1b2c3d4-e5f6-7890'
""")
df.collect()
In Spark UI, check the SQL tab for your query. Look at the "number of output rows" from the Parquet scan — with bloom filters, this should be significantly lower than without, because fewer row groups are read.
Bloom Filters and Existing Data
Important: Enabling bloom filters on an existing table only affects newly written files. Existing Parquet files will NOT retroactively get bloom filters.
To add bloom filters to existing data, you must rewrite the files:
-- Enable bloom filters
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);
-- Rewrite all data files to include bloom filters
CALL spark_catalog.system.rewrite_data_files(
table => 'analytics.events'
);
The rewrite_data_files procedure reads all existing Parquet files, applies the current table properties (including bloom filter settings), and writes new files with bloom filters included. This is a one-time operation — all subsequent writes will automatically include bloom filters.
Bloom Filters with Other Optimizations
Bloom filters work best when combined with other Iceberg optimizations:
Bloom Filters + Sort Order
If you sort data by a column, min/max statistics become highly effective for that column — each file has a narrow range. Bloom filters are then unnecessary for the sorted column. Use bloom filters for the non-sorted high-cardinality columns:
CREATE TABLE analytics.events (
event_id BIGINT,
user_id STRING,
event_time TIMESTAMP,
event_type STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
-- Sort by event_type: min/max statistics will be effective
'write.sort-order' = 'event_type ASC',
-- Bloom filter on user_id: min/max statistics are NOT effective
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);
Bloom Filters + Bucket Partitioning
For tables bucketed by a column (e.g., bucket(64, user_id)), the bucket partition already limits files per query. Bloom filters provide additional row-group-level filtering within each bucket:
CREATE TABLE analytics.events (
event_id BIGINT,
user_id STRING,
event_time TIMESTAMP
)
USING iceberg
PARTITIONED BY (bucket(64, user_id), day(event_time))
TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);
Here, bucketing reduces the search to ~1/64th of files, and the bloom filter further reduces the row groups read within those files.
Bloom Filters + Column Metrics
Ensure column statistics are enabled for the bloom-filter-enabled columns so that min/max filtering works alongside bloom filters:
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.metadata.metrics.column.user_id' = 'full',
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);
Production Checklist
When to enable:
- Column has high cardinality (>10,000 distinct values per file)
- Queries use equality predicates (
=,IN) on this column - Column is NOT the primary sort key (sort order already handles it)
- Column is NOT the partition key (partition pruning already handles it)
Configuration:
- Enable per column:
write.parquet.bloom-filter-enabled.column.<col>=true - Keep
bloom-filter-max-bytesat 1 MB default unless profiling shows a need to change - Limit to 2-3 columns per table — each adds ~1 MB to every Parquet file's footer
Validation:
- After first write, use
parquet-cli bloom-filter -c <col> -v <value> <file>to confirm filter exists - Check file sizes before and after — expect 1-5% increase for 128-256 MB files
- Run a point lookup and check Spark UI for reduced row group reads
Maintenance:
- After enabling, run
rewrite_data_filesto backfill bloom filters on existing data - Monitor file size growth — if bloom filters are significantly inflating files, reduce
bloom-filter-max-bytes - Regularly compact tables — many small files reduce bloom filter effectiveness because each file has fewer distinct values per row group
How Cazpian Handles This
On Cazpian, bloom filter configuration is available as a table-level setting in the managed catalog. When you enable bloom filters on a column, Cazpian automatically schedules a rewrite_data_files operation to backfill existing data. Cazpian's query optimizer also surfaces bloom filter skip metrics in the query history dashboard, so you can see exactly how many row groups were skipped per query.
What's Next
This post covered bloom filters — the third layer of Iceberg's query filtering pipeline. For the full picture, see our other posts in this series:
- Iceberg Query Performance Tuning — partition pruning, file-level min/max statistics, and Spark read configs.
- Iceberg Table Design — how to choose partition transforms and write properties.
- Storage Partitioned Joins — eliminating shuffle operations with bucket partitioning.
- Iceberg on AWS: S3FileIO — S3FileIO, ObjectStoreLocationProvider, and avoiding S3 throttling.