Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors
The job fails. The log says OutOfMemoryError. Someone doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody asks why. The team moves on, carrying twice the compute cost forever.
Three months later, data volume grows. The same job fails again. They double it to 32 GB. Then 64 GB. Then they hit the maximum instance type and cannot scale further. Only then does someone ask: "What is actually consuming all this memory?"
This is the most expensive pattern in data engineering. Teams treat Spark memory as a single dial and OOM errors as a signal to turn it up. They do not distinguish between driver OOM and executor OOM. They do not check whether the problem is a single skewed partition, an oversized broadcast, a missing unpersist(), or a collect() call buried in a utility function. They do not know how to read GC logs, check spill metrics, or use the Spark UI to pinpoint the memory bottleneck.
This post gives you the complete debugging toolkit. We cover every type of OOM error, how to tell driver from executor, a step-by-step debugging workflow, GC tuning for G1GC and ZGC, memory observability with Prometheus and Grafana, the most common anti-patterns, and how Cazpian eliminates the guesswork.
Every Type of OOM Error
Not all OOM errors are the same. Each one has a different root cause, appears in different logs, and requires a different fix. The first step in debugging is identifying which type you have.
1. Java Heap Space
java.lang.OutOfMemoryError: Java heap space
What it means: The JVM heap is exhausted. The garbage collector cannot free enough memory to satisfy a new allocation request.
Where it appears: Driver logs or executor logs (check which one).
Common causes:
- Executor: Large shuffle partitions that do not fit in execution memory, oversized hash tables during joins or aggregations, too many concurrent tasks sharing limited heap
- Driver:
collect()on a large DataFrame, building a large broadcast variable, accumulating too many results
Key diagnostic: Check whether this appears in the driver log or executor log. If executor, check which stage and task failed in the Spark UI Stages tab.
2. GC Overhead Limit Exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
What it means: The JVM is spending more than 98% of its time in garbage collection and recovering less than 2% of the heap with each collection cycle. The JVM gives up.
Where it appears: Driver or executor logs.
Common causes:
- Heap is nearly full and the GC is churning — repeatedly scanning the same long-lived objects
- Too many small objects (e.g., Java Strings from deserialization, UDF intermediate results)
- Cached data occupying Old Gen, leaving too little space for execution allocations
Key diagnostic: This is often worse than a simple heap space error — it means the application was effectively frozen doing GC for an extended period before the error was thrown. Enable GC logging to see the pattern.
3. Direct Buffer Memory
java.lang.OutOfMemoryError: Direct buffer memory
What it means: NIO direct byte buffers (allocated outside the JVM heap, in native memory) are exhausted.
Where it appears: Executor logs, usually during shuffle read/write.
Common causes:
- Shuffle operations use Netty for network transfers, which allocates direct byte buffers
spark.reducer.maxSizeInFlight(default 48 MB) controls how many shuffle blocks are fetched concurrently — too high with many concurrent tasks can exhaust direct memory- Insufficient
-XX:MaxDirectMemorySizeJVM option
Fix:
spark.executor.extraJavaOptions = -XX:MaxDirectMemorySize=2g
Or increase spark.executor.memoryOverhead to give the process more non-heap room.
4. Metaspace
java.lang.OutOfMemoryError: Metaspace
What it means: The JVM Metaspace (which holds class metadata, method bytecodes, and constant pools) is full.
Where it appears: Driver or executor logs.
Common causes:
- Applications that dynamically generate classes (some serialization frameworks, code generation)
- Large numbers of UDFs each creating new class definitions
- Spark's whole-stage code generation creating many compiled classes for complex queries
Fix:
spark.executor.extraJavaOptions = -XX:MaxMetaspaceSize=512m
spark.driver.extraJavaOptions = -XX:MaxMetaspaceSize=512m
Default Metaspace is unlimited in size but may be restricted by container memory limits.
5. Container Killed by YARN
Container killed by YARN for exceeding memory limits.
X GB of X GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
What it means: The total physical memory (RSS) of the executor process exceeded the YARN container limit. YARN killed the container. Exit code: 137.
Where it appears: YARN NodeManager logs and Spark driver logs.
Common causes:
- Off-heap memory (NIO direct buffers, native memory) exceeded
spark.yarn.executor.memoryOverhead - PySpark Python worker memory not accounted for
- Off-heap Spark memory enabled but not added to container calculation
- JVM internal overhead (thread stacks, Metaspace, JIT compiled code) exceeded the overhead budget
Key characteristic: There is no Java stack trace. The process was killed externally. This makes it the hardest OOM error to diagnose.
Fix: Increase spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead.
6. Kubernetes OOMKilled
Reason: OOMKilled
Exit Code: 137
What it means: The pod's memory usage exceeded the Kubernetes memory limit. The cgroup OOM killer terminated the process.
Where it appears: kubectl describe pod <pod-name> output.
Common causes: Same as YARN container kills — off-heap memory exceeding the overhead budget.
Fix: Increase spark.kubernetes.memoryOverheadFactor or spark.executor.memoryOverhead.
7. SparkOutOfMemoryError
org.apache.spark.memory.SparkOutOfMemoryError:
Unable to acquire X bytes of memory, got 0
What it means: Spark's internal memory manager (not the JVM) could not allocate memory for a task. The unified execution memory pool is exhausted.
Where it appears: Executor logs, in a specific task.
Common causes:
- A single task is trying to build a hash table (ShuffledHashJoin, BroadcastHashJoin) that is larger than available execution memory
- Skewed partition — one partition has far more data than others
- Too many concurrent tasks competing for the shared execution memory pool
Key diagnostic: Check the Stages tab for the failed task. Look at input size, shuffle read size, and spill metrics. Compare the failed task's metrics with the median — if max >> median, you have data skew.
Fix: Increase partitions (spark.sql.shuffle.partitions), fix data skew, or increase executor memory.
Step 1: Driver or Executor?
The most critical first question. The answer determines your entire debugging path.
How to Tell
Check the Spark UI Executors tab:
- If the driver row shows failure or high memory → driver OOM
- If an executor row shows failure → executor OOM
Check the logs:
- Driver OOM: the error appears in the application master log or the client-mode driver log. The application itself terminates
- Executor OOM: the error appears in an executor's stderr log. The executor is lost, and tasks are retried on other executors (unless max retries are exhausted)
Check the behavior:
- Driver OOM: the entire application crashes immediately
- Executor OOM: individual tasks fail and are retried. The application may survive if enough retries succeed
Driver OOM Debugging Path
If the OOM is on the driver:
- Check for collect(). Search your code for
.collect(),.toPandas(),.take(large_number),.show(large_number). These pull data to the driver - Check broadcast sizes. Look at the SQL tab for
BroadcastExchangenodes. Check the data size — if it is hundreds of MB, the driver needs to hold that in memory - Check accumulators. Custom accumulators that store large values (lists, maps) accumulate on the driver
- Check maxResultSize.
spark.driver.maxResultSize(default 1 GB) limits the total result size. If the job is collecting more than this, it fails with a different error — but if it is set too high, the actual collection can OOM the driver
Fix: Increase spark.driver.memory, or (better) eliminate the data collection. Replace collect() with .write(), replace toPandas() with pandas_udf processing on executors.
Executor OOM Debugging Path
If the OOM is on an executor:
Continue to Step 2 below.
Step 2: Which Stage and Task Failed?
Open the Spark UI → Stages tab. Find the failed stage. Click into it.
What to Look For
Failed Tasks table: Shows which tasks failed and why. The error message tells you the OOM type.
Task Metrics to compare:
| Metric | What It Tells You |
|---|---|
| Input Size | How much data the task read from source |
| Shuffle Read | How much data arrived via shuffle |
| Shuffle Write | How much data was written for the next stage |
| Spill (Memory) | Data that was in memory before spilling |
| Spill (Disk) | Data written to disk because memory was full |
| GC Time | Time spent in garbage collection |
| Duration | Total task time |
The Skew Check
Compare the failed task's metrics with the median across all tasks in the stage:
If max(Shuffle Read) >> 10x median(Shuffle Read) → data skew
If max(Input Size) >> 10x median(Input Size) → partition skew
If max(Duration) >> 10x median(Duration) → straggler task
If you see skew, the fix is not more memory — it is fixing the skew. See our Spark Data Skew guide for techniques (AQE skew join, salting, isolate-and-broadcast).
Step 3: Check Spill Metrics
Spill metrics in the Stages tab reveal whether the executor was under memory pressure.
Spill (Memory): The size of deserialized data that was in memory before being spilled.
Spill (Disk): The size of serialized data written to disk during the spill.
What Spill Tells You
- No spill, but OOM: The operation cannot spill — likely a hash table build (ShuffledHashJoin, BroadcastHashJoin) that must fit entirely in memory. Fix: use SortMergeJoin instead (it spills gracefully)
- Large spill, no OOM: The operation is spilling successfully but performance is degraded. Fix: increase executor memory or reduce partition sizes
- Spill + OOM: The spill mechanism was overwhelmed — too much data for even disk-backed processing. Fix: increase partitions to reduce per-task data volume
Spill Ratio
Spill Ratio = Spill (Disk) / Spill (Memory)
- Ratio ~0.3-0.5: Normal (serialized data is smaller than deserialized due to compression)
- Ratio near 1.0: Poor compression — data may not benefit from columnar format
Step 4: Check GC Time
In the Stages tab, expand the Summary Metrics section. Check the GC Time column.
What GC Time Tells You
- GC Time < 5% of task time: Normal. Not a memory problem
- GC Time 5-10% of task time: Borderline. Monitor but may be acceptable
- GC Time > 10% of task time: Memory pressure. The GC is struggling
- GC Time > 30% of task time: Severe. Likely approaching GC overhead limit
GC Time Patterns
High GC on all tasks: The executor heap is too small for the workload. Increase spark.executor.memory or reduce spark.executor.cores (fewer concurrent tasks = less memory pressure per task).
High GC on a few tasks: Data skew. Some partitions are much larger than others. The skewed tasks create more objects, causing more GC.
GC time increases over the stage lifetime: Memory leak or cache accumulation. Objects that should be short-lived are surviving into Old Gen.
Step 5: Enable GC Logging
For detailed GC analysis, enable GC logging:
Java 11+ (recommended):
spark.executor.extraJavaOptions = -Xlog:gc*:file=/tmp/gc-executor.log:time,uptime,level,tags
spark.driver.extraJavaOptions = -Xlog:gc*:file=/tmp/gc-driver.log:time,uptime,level,tags
Java 8:
spark.executor.extraJavaOptions = -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/tmp/gc-executor.log
What to Look For in GC Logs
Full GC events: These are stop-the-world pauses that freeze all threads. Frequent full GCs indicate severe memory pressure.
[Full GC (Allocation Failure) 14G->12G(16G), 8.234 secs]
This says: Full GC freed only 2 GB from a 16 GB heap, and it took 8 seconds. The heap is 75% full of long-lived objects (cached data, broadcast variables).
Humongous allocations (G1GC): Objects larger than half a G1 region are allocated directly in Old Gen, bypassing Young Gen. This triggers more Full GCs.
[G1Ergonomics ... Humongous Allocation, size: 33554448]
Fix: Increase G1HeapRegionSize so fewer objects are classified as humongous.
To-space exhausted (G1GC): G1 cannot find enough free regions to copy surviving objects during evacuation. This triggers a Full GC.
[GC pause (G1 Evacuation Pause) (to-space exhausted)]
Fix: Increase heap size or reduce the live data set.
GC Tuning for Spark
G1GC (Recommended for Most Workloads)
G1GC is the default garbage collector for Spark on Java 11+. It divides the heap into equal-sized regions and collects regions with the most garbage first.
Recommended configuration:
spark.executor.extraJavaOptions = -XX:+UseG1GC \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:G1HeapRegionSize=32m \
-XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=8 \
-XX:ConcGCThreads=2
Key parameters:
| Parameter | Default | Recommended | Why |
|---|---|---|---|
InitiatingHeapOccupancyPercent | 45 | 35 | Start concurrent marking earlier to avoid Full GC |
G1HeapRegionSize | auto | 32m (heaps >16 GB) | Reduce humongous allocations for Spark's large buffers |
MaxGCPauseMillis | 200 | 200 | Target pause time (G1 adjusts collection to meet this) |
ParallelGCThreads | auto | 8 | Parallel GC threads (for STW phases) |
ConcGCThreads | auto | 2 | Concurrent marking threads |
ZGC (For Ultra-Low Latency)
ZGC provides sub-millisecond pause times regardless of heap size. Available on Java 21+ with generational mode.
spark.executor.extraJavaOptions = -XX:+UseZGC -XX:+ZGenerational
When to use ZGC:
- Heap sizes > 64 GB where G1GC pauses become long
- Workloads that are latency-sensitive (interactive queries)
- When GC tuning of G1GC is not achieving acceptable pause times
Tradeoff: ZGC uses more CPU for concurrent collection and more memory for colored pointers (additional metadata per object reference). Do not use ZGC if CPU is the bottleneck.
How Cached Data Affects GC
Cached DataFrames and broadcast variables are long-lived objects that survive into the Old Gen. The GC must scan them every Full GC cycle but can never collect them (they are still referenced).
Impact:
- More cached data = larger live set = longer GC pauses
- G1GC's Old Gen fills up faster, triggering concurrent marking earlier
- If the live set (cached data + active execution) approaches heap size, the GC thrashes
Fix: Cache only what is reused. Call unpersist() when done. Monitor the Storage tab to see how much is cached. Use MEMORY_AND_DISK_SER instead of MEMORY_ONLY to reduce the in-heap footprint.
Kryo vs Java Serialization
Spark's default JavaSerializer creates verbose serialized representations with many small Java objects. Switching to Kryo reduces both serialized size and GC pressure.
spark.serializer = org.apache.spark.serializer.KryoSerializer
Impact:
- Shuffle data is 2-10x more compact → less network transfer, less memory
- Fewer Java objects created during deserialization → less GC work
- Broadcast variables are smaller → less driver memory, faster distribution
Memory Observability
Spark UI Executors Tab
The Executors tab shows real-time memory usage per executor:
| Column | What It Shows |
|---|---|
| Storage Memory | used / total — how much cached data each executor holds |
| On Heap | JVM heap memory usage (unified + user) |
| Off Heap | Off-heap memory usage (if enabled) |
| Disk Used | Spilled or MEMORY_AND_DISK cached data on local disk |
What to look for:
- All executors at 90%+ storage → over-caching
- One executor at 95%, rest at 30% → data skew in cached partitions
- Large "Disk Used" → memory pressure causing spill
Spark UI Storage Tab
Shows every cached RDD/DataFrame:
| Column | What It Shows |
|---|---|
| RDD Name | The cached dataset name |
| Storage Level | MEMORY_ONLY, MEMORY_AND_DISK, etc. |
| Cached Partitions | Number of partitions in cache |
| Size in Memory | Memory consumed |
| Size on Disk | Disk consumed (for spilled partitions) |
What to look for:
- Datasets with
0% cached→ evicted by execution memory pressure - Very large cached datasets that are not being reused → wasted memory
- Mismatched storage levels → some data MEMORY_ONLY when it should be MEMORY_AND_DISK
Spark UI Stages Tab — Summary Metrics
Expand the summary metrics for any stage to see:
- Shuffle Read Size — distribution across tasks (min, 25th, median, 75th, max)
- Shuffle Write Size — same distribution
- Spill (Memory / Disk) — which tasks are spilling
- GC Time — GC distribution across tasks
- Peak Execution Memory — maximum execution memory per task
The skew test: If max >> 10x median for any metric, you have skew.
Spark UI Environment Tab
Shows all configured memory parameters:
spark.executor.memory,spark.driver.memoryspark.memory.fraction,spark.memory.storageFractionspark.executor.memoryOverhead- All JVM options (GC flags, off-heap settings)
Use this to verify your configuration is actually applied.
REST API
Programmatic access to executor metrics:
GET /api/v1/applications/{appId}/executors
Returns JSON with memory metrics per executor — useful for automated monitoring.
Prometheus + Grafana
For production monitoring, export Spark metrics to Prometheus:
spark.ui.prometheus.enabled = true
This exposes a /metrics/executors/prometheus endpoint that Prometheus can scrape.
Key metrics to dashboard:
| Metric | Alert Threshold |
|---|---|
jvm.heap.used / jvm.heap.max | > 85% sustained |
executor.GCTime / executor.runTime | > 10% |
executor.diskBytesSpilled | > 0 (sustained) |
executor.memoryBytesSpilled | > 0 (sustained) |
executor.failedTasks | > 0 |
JMX metrics sink:
# metrics.properties
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
Enables JMX exposure of all Spark metrics for tools like VisualVM, JConsole, or JMX-to-Prometheus exporters.
Grafana Dashboard Setup
- Configure Spark to export to Prometheus (
spark.ui.prometheus.enabled=true) - Configure Prometheus to scrape the Spark master/executor endpoints
- Import a pre-built Spark dashboard (Grafana Dashboard ID 7890) or build custom panels
- Key panels: heap usage over time, GC pause duration, spill bytes, executor memory distribution
Memory Anti-Patterns
1. Doubling Memory Without Diagnosis
The most expensive anti-pattern. A 16 GB executor OOM does not mean you need 32 GB — it often means one partition has 10x the data of others, and salting would fix it at 16 GB.
Fix: Always run the debugging workflow (Steps 1-5) before changing memory.
2. collect() on Large DataFrames
# This pulls ALL data to driver — guaranteed OOM for large DataFrames
all_rows = df.collect()
# This too — creates a Pandas DataFrame in driver memory
pandas_df = df.toPandas()
Fix: Process data on executors. Write results to storage. Use .take(N) for samples. Use pandas_udf for pandas operations on executors.
3. Caching Too Much Data
# Caching a 500 GB table on a cluster with 100 GB total memory
large_table.cache()
large_table.count() # Triggers caching
# Result: eviction thrash, GC storms, performance worse than no cache
Fix: Cache only frequently reused, small-to-medium datasets. Monitor the Storage tab. Call unpersist() when done.
4. Broadcasting Tables That Expand in Memory
A 200 MB Parquet file can decompress to 1-2 GB in memory. The optimizer sees 200 MB (under the broadcast threshold) and broadcasts it. The driver needs 2-4 GB to hold the deserialized and serialized copies simultaneously.
Fix: Check actual in-memory size with df.cache(); df.count() and check the Storage tab. If the in-memory size is too large, use the MERGE hint to force SortMergeJoin.
5. Not Calling unpersist()
# Cache is created
intermediate_df.cache()
intermediate_df.count()
# ... use intermediate_df ...
# Cache is NEVER released — it stays until the application ends
# or is evicted by memory pressure
Fix: Always call intermediate_df.unpersist() when the cached data is no longer needed.
6. Too Many Cores Per Executor
spark.executor.memory = 16g
spark.executor.cores = 8
Per-task unified memory: (16,384 - 300) × 0.6 / 8 = 1,206 MB per task.
If any task needs more than 1.2 GB of execution memory (large shuffle, hash table), it fails.
Fix: Reduce cores per executor or increase memory. A common sweet spot is 4-5 cores per executor with proportional memory.
7. Too Few Shuffle Partitions
spark.sql.shuffle.partitions = 200 -- default
With 1 TB of data and 200 partitions, each partition is ~5 GB. If execution memory per task is 2 GB, every task spills heavily.
Fix: Increase shuffle partitions. A good starting rule: partition size should be 100-200 MB after shuffle.
spark.sql.shuffle.partitions = 5000 -- for 1 TB data: ~200 MB per partition
Or enable AQE coalescing to let Spark auto-tune:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
8. PySpark UDFs Creating Object Pressure
Regular Python UDFs process rows one at a time through Py4J serialization:
@udf("double")
def slow_udf(x):
return x * 2 # Each row: Python call overhead + serialization
Each row creates serialized objects on both JVM and Python sides.
Fix: Use pandas_udf (vectorized UDFs) that process entire columns:
@pandas_udf("double")
def fast_udf(x: pd.Series) -> pd.Series:
return x * 2 # Processes entire column via Arrow
9. Ignoring Spill in Production
Spill to disk is a safety mechanism, not a normal operating mode. A job that spills 100 GB to disk per stage is running 5-10x slower than it should.
Fix: If you see sustained spill in production, either increase executor memory or increase partition count to reduce per-task data volume.
Memory Optimization Techniques
Right-Sizing Executors
Start with a moderate configuration and scale based on metrics:
# Starting point for most workloads
spark.executor.memory = 8g
spark.executor.cores = 4
spark.executor.memoryOverhead = 2g
# Per-task memory: (8192 - 300) × 0.6 / 4 = 1,183 MB
Monitor spill, GC time, and memory usage. If spill > 0 or GC > 10%, increase memory. If utilization is consistently < 50%, decrease memory.
Kryo Serialization
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired = false
Reduces shuffle data size by 2-10x and GC overhead from serialization objects.
AQE Automatic Tuning
spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
AQE coalesces small post-shuffle partitions (reducing scheduling overhead) and splits skewed partitions (preventing OOM on straggler tasks).
Filter and Project Early
# Anti-pattern: join full tables, then filter and select
result = orders.join(customers, "customer_id") \
.filter(col("order_date") >= "2024-01-01") \
.select("order_id", "customer_name")
# Better: filter and select before join
filtered_orders = orders \
.filter(col("order_date") >= "2024-01-01") \
.select("order_id", "customer_id")
filtered_customers = customers.select("customer_id", "customer_name")
result = filtered_orders.join(broadcast(filtered_customers), "customer_id")
Fewer rows and columns entering the join = less shuffle data = less memory pressure = less spill.
Use SortMergeJoin for Memory Safety
ShuffledHashJoin builds a hash table from the build side's local partition. If the partition is larger than available memory, the task OOMs with no recovery.
SortMergeJoin sorts both sides and merges — it spills to disk when memory is exhausted. It is slower but never OOMs on data size alone.
# Force SMJ when memory safety is more important than speed
result = df1.hint("merge").join(df2, "key")
Off-Heap for GC-Bound Workloads
If GC is consistently > 15% of task time despite tuning:
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 4g # Add to executor.memoryOverhead too
spark.executor.memoryOverhead = 6g # 2g regular + 4g off-heap
Off-heap data is invisible to the GC, reducing heap pressure.
The Complete Debugging Flowchart
OOM Error
|
├── Where did it happen?
│ ├── Driver log → Driver OOM
│ │ ├── Is there a collect()/toPandas()? → Remove it
│ │ ├── Is there a large broadcast? → Reduce or use MERGE hint
│ │ ├── Large accumulators? → Simplify accumulators
│ │ └── Increase spark.driver.memory (last resort)
│ │
│ ├── Executor log → Executor OOM
│ │ ├── Which error type?
│ │ │ ├── Java heap space → Step 2 below
│ │ │ ├── GC overhead limit → GC tuning + increase memory
│ │ │ ├── Direct buffer → Increase memoryOverhead
│ │ │ ├── Metaspace → Add -XX:MaxMetaspaceSize=512m
│ │ │ └── SparkOutOfMemoryError → Step 2 below
│ │ │
│ │ ├── Step 2: Check Stages tab for failed task
│ │ │ ├── Max >> 10x Median? → Data skew (fix skew, not memory)
│ │ │ ├── All tasks failing? → Partition too large (increase partitions)
│ │ │ └── One task failing? → Skewed key or outlier data
│ │ │
│ │ ├── Step 3: Check spill metrics
│ │ │ ├── No spill + OOM → Hash join can't fit → Use SMJ
│ │ │ ├── Large spill + OOM → Increase partitions or memory
│ │ │ └── Large spill, no OOM → Increase memory for performance
│ │ │
│ │ └── Step 4: Check GC time
│ │ ├── GC > 10% → Enable GC logging, tune G1GC
│ │ ├── GC > 30% → Consider off-heap or increase memory
│ │ └── GC < 5% → Not GC-related
│ │
│ └── Container killed (no stack trace) → Container OOM
│ ├── YARN exit 137 → Increase memoryOverhead
│ ├── K8s OOMKilled → Increase memoryOverheadFactor
│ └── Check if off-heap or PySpark memory is accounted for
│
└── Fix the root cause, not the symptom.
Do NOT just double executor memory.
How Cazpian Eliminates the Guesswork
The debugging workflow described above requires deep Spark expertise, access to Spark UI, knowledge of GC internals, and time to investigate. Most data engineering teams do not have this. They double memory and move on.
Cazpian changes this in three ways:
SQL Editor: Compute Usage After Every Query
When a query completes in the Cazpian SQL editor, the results include compute metrics alongside the data:
- Memory used — execution, storage, and peak unified memory
- Spill metrics — whether data spilled to disk and how much
- Shuffle metrics — bytes shuffled, partitions read/written
- Table metrics — rows scanned, files read, bytes read
Teams see immediately whether their query was efficient or wasteful — before it becomes a production problem. No digging through Spark UI. No guessing.
Job-Level Observability
For batch pipelines, Cazpian provides per-job memory breakdown:
- Which stages consumed the most memory
- Whether any stage spilled
- GC time percentage per stage
- Data skew indicators (max vs median task metrics)
This surfaces the information that the debugging flowchart above requires — automatically, for every job.
AI Studio for Failure Debugging
When a job fails with OOM, Cazpian's AI Studio can analyze the failure:
- Identifies the OOM type (driver, executor, container)
- Pinpoints the failed stage and root cause (skew, oversized broadcast, cache pressure)
- Suggests specific fixes (increase partitions, add salt, reduce broadcast threshold, tune GC)
- Provides the exact configuration changes needed
Instead of a senior engineer spending 30 minutes tracing through logs and Spark UI, AI Studio delivers the diagnosis in seconds. The team fixes the root cause instead of doubling memory.
Configuration Reference
Memory Sizing
| Configuration | Default | Description |
|---|---|---|
spark.executor.memory | 1g | JVM heap per executor |
spark.driver.memory | 1g | JVM heap for driver |
spark.driver.maxResultSize | 1g | Max serialized result per action |
spark.memory.fraction | 0.6 | Fraction of heap for unified pool |
spark.memory.storageFraction | 0.5 | Protected storage floor |
spark.memory.offHeap.enabled | false | Enable off-heap |
spark.memory.offHeap.size | 0 | Off-heap allocation |
Memory Overhead
| Configuration | Default | Description |
|---|---|---|
spark.executor.memoryOverhead | max(mem × 0.10, 384 MB) | Executor off-heap overhead |
spark.driver.memoryOverhead | max(mem × 0.10, 384 MB) | Driver off-heap overhead |
spark.yarn.executor.memoryOverhead | max(mem × 0.07, 384 MB) | YARN-specific override |
spark.kubernetes.memoryOverheadFactor | 0.10 / 0.40 | K8s overhead factor |
GC Options
| JVM Flag | Recommended | Description |
|---|---|---|
-XX:+UseG1GC | Yes | Use G1 garbage collector |
-XX:InitiatingHeapOccupancyPercent | 35 | Start marking earlier |
-XX:G1HeapRegionSize | 32m | Reduce humongous allocations |
-XX:MaxGCPauseMillis | 200 | Target pause time |
-XX:+UseZGC -XX:+ZGenerational | For low-latency | Sub-ms pauses (Java 21+) |
-XX:MaxMetaspaceSize | 512m | Prevent Metaspace OOM |
-XX:MaxDirectMemorySize | 2g | Prevent direct buffer OOM |
Partition Tuning
| Configuration | Default | Description |
|---|---|---|
spark.sql.shuffle.partitions | 200 | Shuffle partitions (more = less memory per task) |
spark.sql.files.maxPartitionBytes | 128 MB | Max input partition size |
spark.sql.adaptive.coalescePartitions.enabled | true | AQE auto-coalesce |
spark.sql.adaptive.skewJoin.enabled | true | AQE skew handling |
Serialization
| Configuration | Default | Description |
|---|---|---|
spark.serializer | JavaSerializer | Use KryoSerializer for 2-10x improvement |
spark.kryoserializer.buffer.max | 64m | Max Kryo buffer |
Summary
OOM errors are symptoms, not root causes. The root cause is always one of: data skew, oversized broadcast, missing unpersist, collect to driver, wrong join strategy, or genuinely insufficient memory for the data volume.
The debugging protocol:
- Driver or executor? Check where the error appeared
- Which error type? Each type has a different cause and fix
- Which stage and task? Find the failed task in Spark UI
- Skew check. Compare max vs median metrics — if max >> 10x median, fix skew first
- Spill check. No spill + OOM means the operation cannot spill (hash join) — change the join strategy
- GC check. If GC > 10% of task time, tune GC before adding memory
- Fix the root cause. Not the symptom
The next time a job OOMs, resist the urge to double memory. Open the Spark UI. Run the flowchart. Fix what actually broke. Your cluster bill will thank you.