Skip to main content

5 posts tagged with "Adaptive Query Execution"

View All Tags

The Complete Apache Spark and Iceberg Performance Tuning Checklist

· 35 min read
Cazpian Engineering
Platform Engineering Team

The Complete Apache Spark and Iceberg Performance Tuning Checklist

You have a Spark job running on Iceberg tables. It works, but it is slow, expensive, or both. You have read a dozen blog posts about individual optimizations — broadcast joins, AQE, partition pruning, compaction — but you do not have a single place that tells you what to check, in what order, and what the correct configuration values are. Every tuning session turns into a scavenger hunt across documentation, Stack Overflow, and tribal knowledge.

This post is the checklist you run through every time. We cover every performance lever in the Spark and Iceberg stack, organized from highest impact to lowest, with the exact configurations, recommended values, and links to our deep-dive posts for the full explanation. If you only have 30 minutes, work through the first five sections. If you have a day, work through all sixteen. Every item has been validated in production workloads on the Cazpian lakehouse platform.

Spark Execution Plan Deep Dive: Reading EXPLAIN Like a Pro

· 36 min read
Cazpian Engineering
Platform Engineering Team

Spark Execution Plan Deep Dive: Reading EXPLAIN Like a Pro

You open the Spark UI after a three-hour job finishes. The SQL tab shows a wall of operators — SortMergeJoin, Exchange hashpartitioning, *(3) HashAggregate, BroadcastExchange HashedRelationBroadcastMode. Your teammate asks why the optimizer did not broadcast the small dimension table. You stare at the plan. You recognize some words. You do not know how to read it.

Every Spark performance guide says "check the execution plan." Every tuning blog says "verify predicate pushdown in the EXPLAIN output." Every debugging guide says "look for unnecessary shuffles." None of them teach you how to actually read the plan from top to bottom — what every operator means, what the asterisk notation is, what the three different filter fields in a FileScan node represent, or how the plan you see before execution differs from the plan that actually runs.

This post is the missing manual. We start with how Spark transforms your code into an execution plan through the Catalyst optimizer pipeline, cover every EXPLAIN mode and when to use each one, walk through every physical plan operator you will encounter, explain whole-stage code generation and what the *(n) notation means, show how to verify predicate pushdown, cover how Adaptive Query Execution rewrites plans at runtime, map EXPLAIN output to the Spark UI, catalog the anti-patterns you can spot in any plan, and finish with a full annotated walkthrough of a real query.

Spark SQL Join Strategy: The Complete Optimization Guide

· 36 min read
Cazpian Engineering
Platform Engineering Team

Spark SQL Join Strategy: The Complete Optimization Guide

Your Spark job runs for 45 minutes. You check the Spark UI and find that a single join stage consumed 38 of those minutes — shuffling 800 GB across the network because the optimizer picked SortMergeJoin for a query where one side was 40 MB after filtering. Nobody ran ANALYZE TABLE. No statistics existed. The optimizer had no idea the table was small enough to broadcast.

The join strategy is the single most impactful decision in a Spark SQL query plan. It determines whether your data shuffles across the network, whether it spills to disk, whether your driver runs out of memory, and whether your query finishes in seconds or hours. Spark offers five distinct join strategies, each with different performance characteristics, memory requirements, and failure modes. The optimizer picks one based on statistics, hints, configuration, and join type — and it often picks wrong when it lacks information.

This post covers every join strategy in Spark, how the JoinSelection decision tree works internally, how the Catalyst optimizer estimates sizes, how CBO reorders multi-table joins, every join hint and when to use it, how AQE converts strategies at runtime, the equi vs non-equi join problem, how to read physical plans, the most common anti-patterns, and a real-world decision framework you can use in production.

Spark Data Skew: The Complete Guide to Identification, Debugging, and Optimization

· 35 min read
Cazpian Engineering
Platform Engineering Team

Spark Data Skew: The Complete Guide to Identification, Debugging, and Optimization

Your 200-node cluster finished 199 of 200 tasks in 30 seconds. The last task has been running for 45 minutes. Every executor except one is idle, burning compute cost while it waits for a single partition containing 80% of the data to finish processing. The stage progress bar is stuck at 99.5%. Your Spark job that should take 2 minutes is taking an hour.

This is data skew -- the single most common and most destructive performance problem in distributed data processing. It turns a perfectly parallelized cluster into an expensive single-threaded computation. It wastes money, wastes time, and breaks SLAs. And it is entirely fixable once you know how to identify it and which optimization to apply.

This post goes deep on every dimension of data skew. We start with what it is and why it kills performance, show exactly how to identify it in the Spark UI, catalog every type of skew you will encounter, cover the AQE automatic optimizations that handle skew at runtime, walk through every manual fix with code examples, address Iceberg-specific skew problems, provide a complete configuration reference, and close with the anti-patterns that cause skew in the first place.

Spark Broadcast Joins: The Complete Guide for Iceberg and Cazpian

· 27 min read
Cazpian Engineering
Platform Engineering Team

Spark Broadcast Joins: The Complete Guide for Iceberg and Cazpian

Your 500-node Spark cluster is shuffling a 2 TB fact table across the network — serializing every row, hashing it, writing it to disk, sending it over the wire, and deserializing it on the other side — just to join it with a 50 MB dimension table. Every single query. Every single day.

The shuffle is the most expensive operation in distributed computing. It consumes network bandwidth, disk I/O, CPU cycles, and memory on every executor in the cluster. And for joins where one side is small, it is completely unnecessary.

Broadcast join eliminates the shuffle entirely. Instead of redistributing both tables across the cluster, Spark sends the small table to every executor, where it is stored as an in-memory hash map. Each executor then joins its local partitions of the large table against this hash map — no network shuffle, no disk spill, no cross-executor coordination. The result is typically a 5-20x speedup over Sort-Merge Join for eligible queries.

This post goes deep. We cover exactly what happens inside a broadcast join, all five ways to trigger one, how AQE converts joins at runtime, the real memory math that catches people off guard, when broadcast hurts instead of helps, and how to monitor and debug broadcast behavior in production.