Skip to main content

2 posts tagged with "spark-sql"

View All Tags

The Complete Apache Spark and Iceberg Performance Tuning Checklist

· 35 min read
Cazpian Engineering
Platform Engineering Team

The Complete Apache Spark and Iceberg Performance Tuning Checklist

You have a Spark job running on Iceberg tables. It works, but it is slow, expensive, or both. You have read a dozen blog posts about individual optimizations — broadcast joins, AQE, partition pruning, compaction — but you do not have a single place that tells you what to check, in what order, and what the correct configuration values are. Every tuning session turns into a scavenger hunt across documentation, Stack Overflow, and tribal knowledge.

This post is the checklist you run through every time. We cover every performance lever in the Spark and Iceberg stack, organized from highest impact to lowest, with the exact configurations, recommended values, and links to our deep-dive posts for the full explanation. If you only have 30 minutes, work through the first five sections. If you have a day, work through all sixteen. Every item has been validated in production workloads on the Cazpian lakehouse platform.

Spark Execution Plan Deep Dive: Reading EXPLAIN Like a Pro

· 36 min read
Cazpian Engineering
Platform Engineering Team

Spark Execution Plan Deep Dive: Reading EXPLAIN Like a Pro

You open the Spark UI after a three-hour job finishes. The SQL tab shows a wall of operators — SortMergeJoin, Exchange hashpartitioning, *(3) HashAggregate, BroadcastExchange HashedRelationBroadcastMode. Your teammate asks why the optimizer did not broadcast the small dimension table. You stare at the plan. You recognize some words. You do not know how to read it.

Every Spark performance guide says "check the execution plan." Every tuning blog says "verify predicate pushdown in the EXPLAIN output." Every debugging guide says "look for unnecessary shuffles." None of them teach you how to actually read the plan from top to bottom — what every operator means, what the asterisk notation is, what the three different filter fields in a FileScan node represent, or how the plan you see before execution differs from the plan that actually runs.

This post is the missing manual. We start with how Spark transforms your code into an execution plan through the Catalyst optimizer pipeline, cover every EXPLAIN mode and when to use each one, walk through every physical plan operator you will encounter, explain whole-stage code generation and what the *(n) notation means, show how to verify predicate pushdown, cover how Adaptive Query Execution rewrites plans at runtime, map EXPLAIN output to the Spark UI, catalog the anti-patterns you can spot in any plan, and finish with a full annotated walkthrough of a real query.