Skip to main content

Iceberg Bloom Filters with Spark: Configuration, Validation, and Performance Guide

· 21 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Bloom Filters with Spark: Configuration, Validation, and Performance Guide

When you query an Iceberg table with WHERE user_id = 'abc-123', Spark reads every Parquet file that could contain that value. It first checks partition pruning — does this file belong to the right partition? Then it checks column statistics — does the min/max range for user_id in this file include 'abc-123'? But for high-cardinality columns like UUIDs, user IDs, session IDs, or trace IDs, min/max statistics are nearly useless. The min might be 'aaa...' and the max might be 'zzz...', so every file passes the min/max check even though only one file actually contains the value.

This is where bloom filters come in. A bloom filter is a compact probabilistic data structure embedded in each Parquet file that can definitively say "this value is NOT in this file" — allowing Spark to skip the file entirely. For point lookups on high-cardinality columns, bloom filters can reduce I/O by 80-90%.

This post covers everything you need to know: how bloom filters work internally, when to use them, how to configure them on Iceberg tables, how to validate they are present in your Parquet files, and what false positives mean for your data correctness.

How Bloom Filters Work

A bloom filter is a bit array combined with a set of hash functions. When you insert a value, the hash functions map it to several positions in the bit array and set those bits to 1. When you check whether a value exists, the same hash functions are applied. If all the corresponding bits are 1, the value might exist (maybe). If any bit is 0, the value is definitely not present (certain).

Detailed diagram of Iceberg bloom filter internals showing the Split Block Bloom Filter structure, per-column configuration, row group skipping mechanics, and false positive rate behavior

This gives bloom filters two critical properties:

  1. No false negatives. If the bloom filter says "not present", the value is guaranteed to be absent. Your query will never miss data because of a bloom filter.
  2. Possible false positives. If the bloom filter says "maybe present", the value might or might not be there. Spark will read the row group to check — it just cannot skip it.

Parquet's Split Block Bloom Filter

Parquet uses a specific implementation called Split Block Bloom Filter (SBBF). Here is how it works:

  • The filter is divided into blocks, each 256 bits (32 bytes).
  • Each block contains eight 32-bit words.
  • A filter consists of z blocks, where z depends on the configured maximum size.
  • Parquet uses 8 hash functions (designed to fit cleanly into SIMD lanes for hardware acceleration).
  • The hash function is xxHash (XXH64) with seed 0.

When a value is checked against the bloom filter:

  1. The value is hashed with XXH64 to produce a 64-bit hash.
  2. The hash determines which block to check (lower 32 bits mod z).
  3. Within that block, the 8 hash functions determine 8 bit positions.
  4. If all 8 bits are set, the value "maybe exists". If any bit is 0, the value "definitely does not exist".

Where Bloom Filters are Stored

Bloom filters are stored in the Parquet file footer metadata — one bloom filter per column per row group. When Spark opens a Parquet file, it reads the footer (which is small — typically a few KB to a few MB) and checks the bloom filter before deciding whether to read the row group data.

This is extremely efficient: Spark loads roughly 1 MB of bloom filter data (the default maximum) to determine whether it needs to read potentially hundreds of MB of actual column data.

Will Bloom Filters Ever Miss Data?

No. Bloom filters will never cause your query to return incorrect or incomplete results.

This is the most important point to understand. A bloom filter can only produce false positives (saying "maybe present" when the value is absent), never false negatives (saying "not present" when the value is actually there).

What this means in practice:

Bloom filter saysActual stateWhat happens
"Not present"Value is absentRow group skipped (correct, saves I/O)
"Not present"Value is presentImpossible — this never happens
"Maybe present"Value is presentRow group is read (correct)
"Maybe present"Value is absentRow group is read unnecessarily (false positive, wastes some I/O)

A false positive means Spark reads a row group it did not need to read — the same thing that would happen without a bloom filter at all. The worst case is that the bloom filter adds no benefit for a particular query, not that it causes wrong results.

False Positive Rate

The false positive probability (FPP) depends on:

  • Filter size: More bytes = more bits = lower FPP.
  • Number of distinct values inserted: More values = more bits set = higher FPP.
  • Number of hash functions: Fixed at 8 in Parquet's SBBF.

A concrete example from the Parquet specification: a filter with 1,024 blocks (32 KB) that has 26,214 distinct values inserted will have a false positive probability of approximately 1.26%. This means that for values that are NOT in the file, 98.74% of the time the bloom filter will correctly identify them as absent and skip the row group.

Iceberg defaults to a maximum bloom filter size of 1 MB per column per row group, which provides a very low false positive rate for most real-world cardinalities.

When to Use Bloom Filters

Bloom filters shine in specific scenarios. They are not a universal optimization.

Ideal Use Cases

1. Point lookups on high-cardinality columns.

Columns like user_id, session_id, trace_id, request_id, order_id, or UUIDs have millions or billions of distinct values. Min/max statistics are useless for these columns because the range spans nearly the entire value space. Bloom filters provide the only effective file-level filtering for equality predicates on these columns.

-- Bloom filter helps: high-cardinality, equality predicate
SELECT * FROM events WHERE user_id = 'a1b2c3d4-e5f6-7890';
SELECT * FROM traces WHERE trace_id = '0af7651916cd43dd8448eb211c80319c';
SELECT * FROM orders WHERE order_id = 987654321;

2. Lookups where the value is often absent.

Bloom filters are most effective when most files do NOT contain the searched value — the filter skips those files entirely. For a table with 1,000 Parquet files where only 1 file contains the target user_id, bloom filters skip 999 files.

3. Columns used in JOIN conditions (non-partitioned).

When Spark performs a sort-merge join or shuffled hash join, it reads row groups from both tables. Bloom filters on the join column can skip row groups that have no matching keys, reducing the data read during the join.

4. CDC/MERGE INTO target tables.

During a MERGE INTO, Spark joins the source against the target table. If the target table has bloom filters on the merge key, Spark can skip target row groups that have no matching source keys. This complements the push-down predicate optimization covered in our MERGE INTO blog.

5. Multi-tenant tables filtered by tenant_id.

If you store data for many tenants in a single table partitioned by date, a bloom filter on tenant_id helps skip row groups that do not contain the queried tenant — even within the same partition.

When NOT to Use Bloom Filters

1. Low-cardinality columns.

Columns like status (3 values), country (200 values), or is_active (2 values) have very few distinct values. Min/max statistics already work well for these — and a bloom filter would almost always say "maybe present" because every file contains most values. The filter just wastes space.

2. Columns already used in partition transforms.

If the table is partitioned by day(event_time) and you filter by event_time, partition pruning already eliminates files at the directory level. Adding a bloom filter on event_time provides minimal additional benefit.

3. Range queries.

Bloom filters only work with equality predicates (=, IN). Range queries (>, <, BETWEEN) cannot use bloom filters — they rely on min/max statistics and column indexes instead.

-- Bloom filter CANNOT help with range queries
SELECT * FROM events WHERE created_at > '2026-02-01';
SELECT * FROM orders WHERE amount BETWEEN 100 AND 500;

4. Columns with sorted data.

If data is sorted by a column (via Iceberg's write.sort-order), min/max statistics are already highly effective because each file contains a narrow, non-overlapping range. A bloom filter adds overhead without benefit.

Which SQL Predicates Work with Bloom Filters?

Not all predicates can leverage bloom filters. Here is the complete picture:

Equality (=) — Works

Single-column equality is the primary use case. The value is hashed and checked against the bloom filter directly.

-- Bloom filter kicks in
SELECT * FROM events WHERE user_id = 'abc-123';

IN Clause — Works

An IN clause is decomposed into multiple equality checks. Spark checks each value against the bloom filter. If the bloom filter says "not present" for every value in the list, the row group is skipped. If any value "maybe exists", the row group is read.

-- Bloom filter checks each value: 'abc', 'def', 'ghi'
SELECT * FROM events WHERE user_id IN ('abc', 'def', 'ghi');

This means IN with a small number of values works well. With a very large IN list (hundreds of values), the probability that at least one value triggers a "maybe present" increases, reducing the skip rate.

AND Conditions — Works

When both sides of an AND have bloom-filter-enabled columns, each side is evaluated independently. If either bloom filter says "not present", the row group is skipped.

-- Both bloom filters are checked; either can skip the row group
SELECT * FROM events
WHERE user_id = 'abc-123'
AND session_id = 'sess-456';

OR Conditions — Limited (Known Issue)

This is a known limitation. When you use OR across different columns, Iceberg's row group filtering does not properly combine bloom filter results with dictionary and min/max evaluators (GitHub #10029).

The root cause: Iceberg evaluates bloom filters, dictionary filters, and min/max statistics as three independent evaluators on the full OR expression, then ANDs the results. If one evaluator cannot rule out a column (e.g., no bloom filter or no dictionary for one of the OR branches), it returns true, which overrides a bloom filter that definitively excluded the row group.

-- Bloom filter on user_id may NOT help here due to OR evaluation bug
SELECT * FROM events
WHERE user_id = 'abc-123'
OR event_type = 'purchase';

In the reported case, this caused a query to take 396 seconds instead of 11 seconds because row groups that could have been skipped were read unnecessarily.

Workaround: Rewrite OR queries as UNION ALL of two separate queries, each with a single equality predicate:

-- Workaround: split OR into UNION ALL so bloom filter works on each branch
SELECT * FROM events WHERE user_id = 'abc-123'
UNION ALL
SELECT * FROM events WHERE event_type = 'purchase'
AND user_id != 'abc-123'; -- deduplicate

NOT Equal (!=, <>) — Does Not Work

Bloom filters can only test for set membership ("is this value present?"), not for exclusion.

-- Bloom filter CANNOT help
SELECT * FROM events WHERE user_id != 'abc-123';

Range Predicates (>, <, >=, <=, BETWEEN) — Does Not Work

Bloom filters are hash-based and have no concept of ordering. Range predicates rely on min/max statistics and column indexes instead.

IS NULL / IS NOT NULL — Does Not Work

Null testing uses separate Parquet null count statistics, not bloom filters.

Predicate Summary

PredicateBloom filter used?Notes
col = valueYesPrimary use case
col IN (v1, v2, v3)YesEach value checked; skip if all absent
col1 = v1 AND col2 = v2YesBoth filters evaluated independently
col1 = v1 OR col2 = v2LimitedKnown issue — may not skip row groups correctly
col != valueNoCannot test exclusion
col > valueNoUse min/max stats instead
col BETWEEN v1 AND v2NoUse min/max stats instead
col IS NULLNoUses null count statistics
col LIKE 'pattern%'NoNot an equality predicate

How to Configure Bloom Filters on Iceberg Tables

Enable Per Column

Bloom filters must be enabled per column. There is no global "enable bloom filters on all columns" setting — this is intentional because bloom filters add file size overhead.

-- Enable bloom filters on specific columns
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true',
'write.parquet.bloom-filter-enabled.column.trace_id' = 'true'
);

Or at table creation time:

CREATE TABLE analytics.events (
event_id BIGINT,
user_id STRING,
session_id STRING,
trace_id STRING,
event_time TIMESTAMP,
event_type STRING,
payload STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true',
'write.parquet.bloom-filter-enabled.column.session_id' = 'true',
'write.parquet.bloom-filter-enabled.column.trace_id' = 'true'
);

Configure Filter Size

The maximum size of the bloom filter bitset controls the false positive rate. Larger filters = lower false positive rate but larger file sizes.

ALTER TABLE analytics.events SET TBLPROPERTIES (
-- Maximum bloom filter size per column per row group (default: 1 MB)
'write.parquet.bloom-filter-max-bytes' = '1048576'
);

Sizing guidance:

Distinct values per row groupRecommended max-bytesApproximate FPP
< 100,000128 KB (131072)< 0.1%
100,000 - 1,000,000256 KB - 512 KB~0.5-1%
1,000,000 - 10,000,0001 MB (default)~1-2%
> 10,000,0002-4 MB~1-2%

For most workloads, the 1 MB default provides an excellent balance between file size overhead and filter accuracy.

Configure False Positive Probability (Per Column)

Iceberg allows setting a target false positive probability per column:

ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.parquet.bloom-filter-fpp.column.user_id' = '0.01',
'write.parquet.bloom-filter-fpp.column.trace_id' = '0.001'
);

Important implementation note: In the current Iceberg Parquet writer, the bloom-filter-max-bytes setting takes precedence. The writer allocates exactly bloom-filter-max-bytes for the filter and does not dynamically size the filter based on FPP and NDV (number of distinct values). This means the FPP property serves as a hint but the actual false positive rate is determined by the fixed filter size and the number of distinct values inserted.

Full Configuration Reference

PropertyDefaultDescription
write.parquet.bloom-filter-enabled.column.<col>falseEnable bloom filter for a specific column
write.parquet.bloom-filter-max-bytes1048576 (1 MB)Maximum bytes for the bloom filter bitset per column per row group
write.parquet.bloom-filter-fpp.column.<col>0.01Target false positive probability (hint — actual FPP depends on filter size and NDV)

How Spark Uses Bloom Filters at Read Time

Understanding the read path helps you reason about when bloom filters will help:

The Three-Layer Filtering Pipeline

When Spark reads an Iceberg table with a WHERE clause, filtering happens in three layers:

Layer 1: Partition pruning. Iceberg's metadata eliminates entire partitions that cannot match the predicate. This is the coarsest filter — it operates at the file/partition level.

Layer 2: Min/max statistics (column statistics). For each data file, Iceberg stores the min and max value of each column. Files where the predicate value falls outside the min/max range are skipped.

Layer 3: Bloom filters (row group level). For files that passed layers 1 and 2, Spark opens the Parquet file and checks the bloom filter in each row group's footer. Row groups where the bloom filter says "not present" are skipped.

1000 files in table
→ Partition pruning: 30 files remain
→ Min/max statistics: 15 files remain
→ Bloom filter: 1 file actually read

For high-cardinality columns, min/max statistics eliminate very few files (the range is too wide). Bloom filters fill this gap by providing precise set-membership testing.

What Spark Reads

  1. Spark reads the Parquet file footer (a few KB at the end of the file).
  2. The footer contains the bloom filter offset and length for each column in each row group.
  3. Spark reads the bloom filter data (up to 1 MB per column) from the specified offset.
  4. Spark hashes the filter value and checks the bloom filter.
  5. If the bloom filter says "not present", the entire row group is skipped — Spark does not read any column data from that row group.

This means the I/O cost of checking a bloom filter is: footer read + bloom filter read ≈ 1-2 MB per file, versus potentially hundreds of MB of column data if the row group is not skipped.

Performance Impact

Benchmarks

Testing from CERN's database team with Spark 3.5 and Parquet 1.13.1 on a 1-million-row dataset showed:

MetricWithout bloom filterWith bloom filterImprovement
Bytes read (absent value)8,299,6561,091,61187% reduction (7.6x)
File size8,010,07710,107,275+26% overhead

Key observations:

  • For values NOT in the file, bloom filters reduced I/O by 87% — Spark skipped the row group entirely after checking the filter.
  • File size increased by 26% due to the bloom filter data stored in the footer. For production tables with the 1 MB default max-bytes, the overhead is typically 1-5% of total file size for 128-256 MB files.
  • For values that ARE in the file, there is no improvement — Spark reads the row group regardless. The bloom filter check adds negligible overhead (microseconds to check a bit array).

Real-World Performance Expectations

The actual speedup depends on your data distribution:

ScenarioExpected improvement
Point lookup, value in 1 of 1000 files~1000x fewer files read
Point lookup, value in 1 of 100 files~100x fewer files read
IN clause with 10 values across 1000 files~10-100x fewer files read
Point lookup, value in most filesNo improvement (false positives)
Range query (BETWEEN, >, <)No improvement (bloom filter not applicable)

The biggest wins come when you are searching for a needle in a haystack — one specific value across thousands of files.

How to Validate Bloom Filters in Parquet Files

After enabling bloom filters, you need to verify they are actually present in the written Parquet files. Here are three methods.

The Apache Parquet CLI tool can inspect bloom filter metadata directly.

Install parquet-cli:

# Download the parquet-cli JAR (adjust version as needed)
wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.4/parquet-cli-1.14.4-runtime.jar

Check footer for bloom filter offsets:

java -jar parquet-cli-1.14.4-runtime.jar footer /path/to/datafile.parquet

The output will show bloom filter offset and length for each column in each row group. If a column has a bloom filter, you will see non-null offset and length values.

Test a specific value against the bloom filter:

# Check if a value exists in the bloom filter for column 'user_id'
java -jar parquet-cli-1.14.4-runtime.jar bloom-filter \
-c user_id \
-v "a1b2c3d4-e5f6-7890" \
/path/to/datafile.parquet

The output will be one of:

  • column user_id has no bloom filter — bloom filter is NOT configured for this column.
  • value a1b2c3d4-e5f6-7890 NOT exists — value is definitively absent from this row group.
  • value a1b2c3d4-e5f6-7890 maybe exists — value might be present (positive or false positive).

Method 2: PySpark — Read Parquet Metadata

You can inspect Parquet file metadata programmatically using PySpark with the pyarrow library:

import pyarrow.parquet as pq

# Read the Parquet file metadata
parquet_file = pq.ParquetFile("/path/to/datafile.parquet")
metadata = parquet_file.metadata

print(f"Number of row groups: {metadata.num_row_groups}")

for rg_idx in range(metadata.num_row_groups):
rg = metadata.row_group(rg_idx)
print(f"\nRow group {rg_idx}: {rg.num_rows} rows")
for col_idx in range(rg.num_columns):
col = rg.column(col_idx)
print(f" Column: {col.path_in_schema}")
print(f" Bloom filter offset: {col.bloom_filter_offset}")
print(f" Bloom filter length: {col.bloom_filter_length}")

If bloom_filter_offset is not None (or not -1), a bloom filter exists for that column in that row group.

Method 3: Spark SQL — Check File Size Difference

A quick heuristic: write the same data with and without bloom filters and compare file sizes.

# Write without bloom filter
spark.sql("""
CREATE TABLE test.no_bloom (user_id STRING, data STRING)
USING iceberg
""")
spark.sql("INSERT INTO test.no_bloom SELECT * FROM source_data")

# Write with bloom filter
spark.sql("""
CREATE TABLE test.with_bloom (user_id STRING, data STRING)
USING iceberg
TBLPROPERTIES ('write.parquet.bloom-filter-enabled.column.user_id' = 'true')
""")
spark.sql("INSERT INTO test.with_bloom SELECT * FROM source_data")
# Compare file sizes on S3
aws s3 ls --recursive s3://bucket/warehouse/test/no_bloom/data/ --summarize
aws s3 ls --recursive s3://bucket/warehouse/test/with_bloom/data/ --summarize

Files with bloom filters will be larger. The overhead depends on bloom-filter-max-bytes and the number of bloom-filter-enabled columns. Typically 1-5% larger for 128-256 MB files with 1 MB bloom filters.

Method 4: Spark EXPLAIN — Check Row Group Filtering

After enabling bloom filters, verify they are being used at query time:

# Enable verbose Spark metrics
spark.conf.set("spark.sql.parquet.filterPushdown", "true")

# Run a point lookup and check the metrics
df = spark.sql("""
SELECT * FROM analytics.events
WHERE user_id = 'a1b2c3d4-e5f6-7890'
""")
df.collect()

In Spark UI, check the SQL tab for your query. Look at the "number of output rows" from the Parquet scan — with bloom filters, this should be significantly lower than without, because fewer row groups are read.

Bloom Filters and Existing Data

Important: Enabling bloom filters on an existing table only affects newly written files. Existing Parquet files will NOT retroactively get bloom filters.

To add bloom filters to existing data, you must rewrite the files:

-- Enable bloom filters
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);

-- Rewrite all data files to include bloom filters
CALL spark_catalog.system.rewrite_data_files(
table => 'analytics.events'
);

The rewrite_data_files procedure reads all existing Parquet files, applies the current table properties (including bloom filter settings), and writes new files with bloom filters included. This is a one-time operation — all subsequent writes will automatically include bloom filters.

Bloom Filters with Other Optimizations

Bloom filters work best when combined with other Iceberg optimizations:

Bloom Filters + Sort Order

If you sort data by a column, min/max statistics become highly effective for that column — each file has a narrow range. Bloom filters are then unnecessary for the sorted column. Use bloom filters for the non-sorted high-cardinality columns:

CREATE TABLE analytics.events (
event_id BIGINT,
user_id STRING,
event_time TIMESTAMP,
event_type STRING
)
USING iceberg
PARTITIONED BY (day(event_time))
TBLPROPERTIES (
-- Sort by event_type: min/max statistics will be effective
'write.sort-order' = 'event_type ASC',
-- Bloom filter on user_id: min/max statistics are NOT effective
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);

Bloom Filters + Bucket Partitioning

For tables bucketed by a column (e.g., bucket(64, user_id)), the bucket partition already limits files per query. Bloom filters provide additional row-group-level filtering within each bucket:

CREATE TABLE analytics.events (
event_id BIGINT,
user_id STRING,
event_time TIMESTAMP
)
USING iceberg
PARTITIONED BY (bucket(64, user_id), day(event_time))
TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);

Here, bucketing reduces the search to ~1/64th of files, and the bloom filter further reduces the row groups read within those files.

Bloom Filters + Column Metrics

Ensure column statistics are enabled for the bloom-filter-enabled columns so that min/max filtering works alongside bloom filters:

ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.metadata.metrics.column.user_id' = 'full',
'write.parquet.bloom-filter-enabled.column.user_id' = 'true'
);

Production Checklist

When to enable:

  • Column has high cardinality (>10,000 distinct values per file)
  • Queries use equality predicates (=, IN) on this column
  • Column is NOT the primary sort key (sort order already handles it)
  • Column is NOT the partition key (partition pruning already handles it)

Configuration:

  • Enable per column: write.parquet.bloom-filter-enabled.column.<col>=true
  • Keep bloom-filter-max-bytes at 1 MB default unless profiling shows a need to change
  • Limit to 2-3 columns per table — each adds ~1 MB to every Parquet file's footer

Validation:

  • After first write, use parquet-cli bloom-filter -c <col> -v <value> <file> to confirm filter exists
  • Check file sizes before and after — expect 1-5% increase for 128-256 MB files
  • Run a point lookup and check Spark UI for reduced row group reads

Maintenance:

  • After enabling, run rewrite_data_files to backfill bloom filters on existing data
  • Monitor file size growth — if bloom filters are significantly inflating files, reduce bloom-filter-max-bytes
  • Regularly compact tables — many small files reduce bloom filter effectiveness because each file has fewer distinct values per row group

How Cazpian Handles This

On Cazpian, bloom filter configuration is available as a table-level setting in the managed catalog. When you enable bloom filters on a column, Cazpian automatically schedules a rewrite_data_files operation to backfill existing data. Cazpian's query optimizer also surfaces bloom filter skip metrics in the query history dashboard, so you can see exactly how many row groups were skipped per query.

What's Next

This post covered bloom filters — the third layer of Iceberg's query filtering pipeline. For the full picture, see our other posts in this series: