Storage Partitioned Joins in Apache Iceberg with Spark
Every Spark join starts the same way: read both sides, shuffle the data across the network so matching keys end up on the same executor, then join. That shuffle is the single most expensive operation in most Spark jobs — it moves data across the network, writes temporary files to disk, and consumes memory on every executor in the cluster.
But what if both tables are already organized by the join key on disk? If the left table's customer_id=42 rows are in bucket 42 and the right table's customer_id=42 rows are also in bucket 42, there is nothing to shuffle. Each executor can join its local partitions independently.
That is exactly what Storage Partitioned Join (SPJ) does. Introduced in Spark 3.3 and matured in Spark 3.4+, SPJ is the most impactful — and least understood — optimization available for Iceberg+Spark workloads. This post shows you how it works, how to set it up, how to verify it, and where it breaks.