Skip to main content

2 posts tagged with "GC Tuning"

View All Tags

The Complete Apache Spark and Iceberg Performance Tuning Checklist

· 35 min read
Cazpian Engineering
Platform Engineering Team

The Complete Apache Spark and Iceberg Performance Tuning Checklist

You have a Spark job running on Iceberg tables. It works, but it is slow, expensive, or both. You have read a dozen blog posts about individual optimizations — broadcast joins, AQE, partition pruning, compaction — but you do not have a single place that tells you what to check, in what order, and what the correct configuration values are. Every tuning session turns into a scavenger hunt across documentation, Stack Overflow, and tribal knowledge.

This post is the checklist you run through every time. We cover every performance lever in the Spark and Iceberg stack, organized from highest impact to lowest, with the exact configurations, recommended values, and links to our deep-dive posts for the full explanation. If you only have 30 minutes, work through the first five sections. If you have a day, work through all sixteen. Every item has been validated in production workloads on the Cazpian lakehouse platform.

Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

· 22 min read
Cazpian Engineering
Platform Engineering Team

Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

The job fails. The log says OutOfMemoryError. Someone doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody asks why. The team moves on, carrying twice the compute cost forever.

Three months later, data volume grows. The same job fails again. They double it to 32 GB. Then 64 GB. Then they hit the maximum instance type and cannot scale further. Only then does someone ask: "What is actually consuming all this memory?"

This is the most expensive pattern in data engineering. Teams treat Spark memory as a single dial and OOM errors as a signal to turn it up. They do not distinguish between driver OOM and executor OOM. They do not check whether the problem is a single skewed partition, an oversized broadcast, a missing unpersist(), or a collect() call buried in a utility function. They do not know how to read GC logs, check spill metrics, or use the Spark UI to pinpoint the memory bottleneck.

This post gives you the complete debugging toolkit. We cover every type of OOM error, how to tell driver from executor, a step-by-step debugging workflow, GC tuning for G1GC and ZGC, memory observability with Prometheus and Grafana, the most common anti-patterns, and how Cazpian eliminates the guesswork.