Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

March 8, 2026 · 22 min read

Platform Engineering Team

The job fails. The log says OutOfMemoryError. Someone doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody asks why. The team moves on, carrying twice the compute cost forever.

Three months later, data volume grows. The same job fails again. They double it to 32 GB. Then 64 GB. Then they hit the maximum instance type and cannot scale further. Only then does someone ask: "What is actually consuming all this memory?"

This is the most expensive pattern in data engineering. Teams treat Spark memory as a single dial and OOM errors as a signal to turn it up. They do not distinguish between driver OOM and executor OOM. They do not check whether the problem is a single skewed partition, an oversized broadcast, a missing unpersist(), or a collect() call buried in a utility function. They do not know how to read GC logs, check spill metrics, or use the Spark UI to pinpoint the memory bottleneck.

This post gives you the complete debugging toolkit. We cover every type of OOM error, how to tell driver from executor, a step-by-step debugging workflow, GC tuning for G1GC and ZGC, memory observability with Prometheus and Grafana, the most common anti-patterns, and how Cazpian eliminates the guesswork.

Every Type of OOM Error

Not all OOM errors are the same. Each one has a different root cause, appears in different logs, and requires a different fix. The first step in debugging is identifying which type you have.

Detailed diagram of Spark OOM error types showing Java heap space, GC overhead, container killed, and driver vs executor OOM with debugging workflow and memory region mapping

1. Java Heap Space

java.lang.OutOfMemoryError: Java heap space

What it means: The JVM heap is exhausted. The garbage collector cannot free enough memory to satisfy a new allocation request.

Where it appears: Driver logs or executor logs (check which one).

Common causes:

Executor: Large shuffle partitions that do not fit in execution memory, oversized hash tables during joins or aggregations, too many concurrent tasks sharing limited heap
Driver: collect() on a large DataFrame, building a large broadcast variable, accumulating too many results

Key diagnostic: Check whether this appears in the driver log or executor log. If executor, check which stage and task failed in the Spark UI Stages tab.

2. GC Overhead Limit Exceeded

java.lang.OutOfMemoryError: GC overhead limit exceeded

What it means: The JVM is spending more than 98% of its time in garbage collection and recovering less than 2% of the heap with each collection cycle. The JVM gives up.

Where it appears: Driver or executor logs.

Common causes:

Heap is nearly full and the GC is churning — repeatedly scanning the same long-lived objects
Too many small objects (e.g., Java Strings from deserialization, UDF intermediate results)
Cached data occupying Old Gen, leaving too little space for execution allocations

Key diagnostic: This is often worse than a simple heap space error — it means the application was effectively frozen doing GC for an extended period before the error was thrown. Enable GC logging to see the pattern.

3. Direct Buffer Memory

java.lang.OutOfMemoryError: Direct buffer memory

What it means: NIO direct byte buffers (allocated outside the JVM heap, in native memory) are exhausted.

Where it appears: Executor logs, usually during shuffle read/write.

Common causes:

Shuffle operations use Netty for network transfers, which allocates direct byte buffers
spark.reducer.maxSizeInFlight (default 48 MB) controls how many shuffle blocks are fetched concurrently — too high with many concurrent tasks can exhaust direct memory
Insufficient -XX:MaxDirectMemorySize JVM option

Fix:

spark.executor.extraJavaOptions = -XX:MaxDirectMemorySize=2g

Or increase spark.executor.memoryOverhead to give the process more non-heap room.

4. Metaspace

java.lang.OutOfMemoryError: Metaspace

What it means: The JVM Metaspace (which holds class metadata, method bytecodes, and constant pools) is full.

Where it appears: Driver or executor logs.

Common causes:

Applications that dynamically generate classes (some serialization frameworks, code generation)
Large numbers of UDFs each creating new class definitions
Spark's whole-stage code generation creating many compiled classes for complex queries

Fix:

spark.executor.extraJavaOptions = -XX:MaxMetaspaceSize=512m
spark.driver.extraJavaOptions = -XX:MaxMetaspaceSize=512m

Default Metaspace is unlimited in size but may be restricted by container memory limits.

5. Container Killed by YARN

Container killed by YARN for exceeding memory limits.
X GB of X GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.

What it means: The total physical memory (RSS) of the executor process exceeded the YARN container limit. YARN killed the container. Exit code: 137.

Where it appears: YARN NodeManager logs and Spark driver logs.

Common causes:

Off-heap memory (NIO direct buffers, native memory) exceeded spark.yarn.executor.memoryOverhead
PySpark Python worker memory not accounted for
Off-heap Spark memory enabled but not added to container calculation
JVM internal overhead (thread stacks, Metaspace, JIT compiled code) exceeded the overhead budget

Key characteristic: There is no Java stack trace. The process was killed externally. This makes it the hardest OOM error to diagnose.

Fix: Increase spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead.

6. Kubernetes OOMKilled

Reason: OOMKilled
Exit Code: 137

What it means: The pod's memory usage exceeded the Kubernetes memory limit. The cgroup OOM killer terminated the process.

Where it appears: kubectl describe pod <pod-name> output.

Common causes: Same as YARN container kills — off-heap memory exceeding the overhead budget.

Fix: Increase spark.kubernetes.memoryOverheadFactor or spark.executor.memoryOverhead.

7. SparkOutOfMemoryError

org.apache.spark.memory.SparkOutOfMemoryError:
Unable to acquire X bytes of memory, got 0

What it means: Spark's internal memory manager (not the JVM) could not allocate memory for a task. The unified execution memory pool is exhausted.

Where it appears: Executor logs, in a specific task.

Common causes:

A single task is trying to build a hash table (ShuffledHashJoin, BroadcastHashJoin) that is larger than available execution memory
Skewed partition — one partition has far more data than others
Too many concurrent tasks competing for the shared execution memory pool

Key diagnostic: Check the Stages tab for the failed task. Look at input size, shuffle read size, and spill metrics. Compare the failed task's metrics with the median — if max >> median, you have data skew.

Fix: Increase partitions (spark.sql.shuffle.partitions), fix data skew, or increase executor memory.

Step 1: Driver or Executor?

The most critical first question. The answer determines your entire debugging path.

How to Tell

Check the Spark UI Executors tab:

If the driver row shows failure or high memory → driver OOM
If an executor row shows failure → executor OOM

Check the logs:

Driver OOM: the error appears in the application master log or the client-mode driver log. The application itself terminates
Executor OOM: the error appears in an executor's stderr log. The executor is lost, and tasks are retried on other executors (unless max retries are exhausted)

Check the behavior:

Driver OOM: the entire application crashes immediately
Executor OOM: individual tasks fail and are retried. The application may survive if enough retries succeed

Driver OOM Debugging Path

If the OOM is on the driver:

Check for collect(). Search your code for .collect(), .toPandas(), .take(large_number), .show(large_number). These pull data to the driver
Check broadcast sizes. Look at the SQL tab for BroadcastExchange nodes. Check the data size — if it is hundreds of MB, the driver needs to hold that in memory
Check accumulators. Custom accumulators that store large values (lists, maps) accumulate on the driver
Check maxResultSize. spark.driver.maxResultSize (default 1 GB) limits the total result size. If the job is collecting more than this, it fails with a different error — but if it is set too high, the actual collection can OOM the driver

Fix: Increase spark.driver.memory, or (better) eliminate the data collection. Replace collect() with .write(), replace toPandas() with pandas_udf processing on executors.

Executor OOM Debugging Path

If the OOM is on an executor:

Continue to Step 2 below.

Step 2: Which Stage and Task Failed?

Open the Spark UI → Stages tab. Find the failed stage. Click into it.

What to Look For

Failed Tasks table: Shows which tasks failed and why. The error message tells you the OOM type.

Task Metrics to compare:

Metric	What It Tells You
Input Size	How much data the task read from source
Shuffle Read	How much data arrived via shuffle
Shuffle Write	How much data was written for the next stage
Spill (Memory)	Data that was in memory before spilling
Spill (Disk)	Data written to disk because memory was full
GC Time	Time spent in garbage collection
Duration	Total task time

The Skew Check

Compare the failed task's metrics with the median across all tasks in the stage:

If max(Shuffle Read) >> 10x median(Shuffle Read) → data skew
If max(Input Size) >> 10x median(Input Size) → partition skew
If max(Duration) >> 10x median(Duration) → straggler task

If you see skew, the fix is not more memory — it is fixing the skew. See our Spark Data Skew guide for techniques (AQE skew join, salting, isolate-and-broadcast).

Step 3: Check Spill Metrics

Spill metrics in the Stages tab reveal whether the executor was under memory pressure.

Spill (Memory): The size of deserialized data that was in memory before being spilled.

Spill (Disk): The size of serialized data written to disk during the spill.

What Spill Tells You

No spill, but OOM: The operation cannot spill — likely a hash table build (ShuffledHashJoin, BroadcastHashJoin) that must fit entirely in memory. Fix: use SortMergeJoin instead (it spills gracefully)
Large spill, no OOM: The operation is spilling successfully but performance is degraded. Fix: increase executor memory or reduce partition sizes
Spill + OOM: The spill mechanism was overwhelmed — too much data for even disk-backed processing. Fix: increase partitions to reduce per-task data volume

Spill Ratio

Spill Ratio = Spill (Disk) / Spill (Memory)

Ratio ~0.3-0.5: Normal (serialized data is smaller than deserialized due to compression)
Ratio near 1.0: Poor compression — data may not benefit from columnar format

Step 4: Check GC Time

In the Stages tab, expand the Summary Metrics section. Check the GC Time column.

What GC Time Tells You

GC Time < 5% of task time: Normal. Not a memory problem
GC Time 5-10% of task time: Borderline. Monitor but may be acceptable
GC Time > 10% of task time: Memory pressure. The GC is struggling
GC Time > 30% of task time: Severe. Likely approaching GC overhead limit

GC Time Patterns

High GC on all tasks: The executor heap is too small for the workload. Increase spark.executor.memory or reduce spark.executor.cores (fewer concurrent tasks = less memory pressure per task).

High GC on a few tasks: Data skew. Some partitions are much larger than others. The skewed tasks create more objects, causing more GC.

GC time increases over the stage lifetime: Memory leak or cache accumulation. Objects that should be short-lived are surviving into Old Gen.

Step 5: Enable GC Logging

For detailed GC analysis, enable GC logging:

Java 11+ (recommended):

spark.executor.extraJavaOptions = -Xlog:gc*:file=/tmp/gc-executor.log:time,uptime,level,tags
spark.driver.extraJavaOptions = -Xlog:gc*:file=/tmp/gc-driver.log:time,uptime,level,tags

Java 8:

spark.executor.extraJavaOptions = -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/tmp/gc-executor.log

What to Look For in GC Logs

Full GC events: These are stop-the-world pauses that freeze all threads. Frequent full GCs indicate severe memory pressure.

[Full GC (Allocation Failure) 14G->12G(16G), 8.234 secs]

This says: Full GC freed only 2 GB from a 16 GB heap, and it took 8 seconds. The heap is 75% full of long-lived objects (cached data, broadcast variables).

Humongous allocations (G1GC): Objects larger than half a G1 region are allocated directly in Old Gen, bypassing Young Gen. This triggers more Full GCs.

[G1Ergonomics ... Humongous Allocation, size: 33554448]

Fix: Increase G1HeapRegionSize so fewer objects are classified as humongous.

To-space exhausted (G1GC): G1 cannot find enough free regions to copy surviving objects during evacuation. This triggers a Full GC.

[GC pause (G1 Evacuation Pause) (to-space exhausted)]

Fix: Increase heap size or reduce the live data set.

GC Tuning for Spark

G1GC (Recommended for Most Workloads)

G1GC is the default garbage collector for Spark on Java 11+. It divides the heap into equal-sized regions and collects regions with the most garbage first.

Recommended configuration:

spark.executor.extraJavaOptions = -XX:+UseG1GC \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:G1HeapRegionSize=32m \
  -XX:MaxGCPauseMillis=200 \
  -XX:ParallelGCThreads=8 \
  -XX:ConcGCThreads=2

Key parameters:

Parameter	Default	Recommended	Why
`InitiatingHeapOccupancyPercent`	45	35	Start concurrent marking earlier to avoid Full GC
`G1HeapRegionSize`	auto	32m (heaps >16 GB)	Reduce humongous allocations for Spark's large buffers
`MaxGCPauseMillis`	200	200	Target pause time (G1 adjusts collection to meet this)
`ParallelGCThreads`	auto	8	Parallel GC threads (for STW phases)
`ConcGCThreads`	auto	2	Concurrent marking threads

ZGC (For Ultra-Low Latency)

ZGC provides sub-millisecond pause times regardless of heap size. Available on Java 21+ with generational mode.

spark.executor.extraJavaOptions = -XX:+UseZGC -XX:+ZGenerational

When to use ZGC:

Heap sizes > 64 GB where G1GC pauses become long
Workloads that are latency-sensitive (interactive queries)
When GC tuning of G1GC is not achieving acceptable pause times

Tradeoff: ZGC uses more CPU for concurrent collection and more memory for colored pointers (additional metadata per object reference). Do not use ZGC if CPU is the bottleneck.

How Cached Data Affects GC

Cached DataFrames and broadcast variables are long-lived objects that survive into the Old Gen. The GC must scan them every Full GC cycle but can never collect them (they are still referenced).

Impact:

More cached data = larger live set = longer GC pauses
G1GC's Old Gen fills up faster, triggering concurrent marking earlier
If the live set (cached data + active execution) approaches heap size, the GC thrashes

Fix: Cache only what is reused. Call unpersist() when done. Monitor the Storage tab to see how much is cached. Use MEMORY_AND_DISK_SER instead of MEMORY_ONLY to reduce the in-heap footprint.

Kryo vs Java Serialization

Spark's default JavaSerializer creates verbose serialized representations with many small Java objects. Switching to Kryo reduces both serialized size and GC pressure.

spark.serializer = org.apache.spark.serializer.KryoSerializer

Impact:

Shuffle data is 2-10x more compact → less network transfer, less memory
Fewer Java objects created during deserialization → less GC work
Broadcast variables are smaller → less driver memory, faster distribution

Memory Observability

Spark UI Executors Tab

The Executors tab shows real-time memory usage per executor:

Column	What It Shows
Storage Memory	`used / total` — how much cached data each executor holds
On Heap	JVM heap memory usage (unified + user)
Off Heap	Off-heap memory usage (if enabled)
Disk Used	Spilled or MEMORY_AND_DISK cached data on local disk

What to look for:

All executors at 90%+ storage → over-caching
One executor at 95%, rest at 30% → data skew in cached partitions
Large "Disk Used" → memory pressure causing spill

Spark UI Storage Tab

Shows every cached RDD/DataFrame:

Column	What It Shows
RDD Name	The cached dataset name
Storage Level	MEMORY_ONLY, MEMORY_AND_DISK, etc.
Cached Partitions	Number of partitions in cache
Size in Memory	Memory consumed
Size on Disk	Disk consumed (for spilled partitions)

What to look for:

Datasets with 0% cached → evicted by execution memory pressure
Very large cached datasets that are not being reused → wasted memory
Mismatched storage levels → some data MEMORY_ONLY when it should be MEMORY_AND_DISK

Spark UI Stages Tab — Summary Metrics

Expand the summary metrics for any stage to see:

Shuffle Read Size — distribution across tasks (min, 25th, median, 75th, max)
Shuffle Write Size — same distribution
Spill (Memory / Disk) — which tasks are spilling
GC Time — GC distribution across tasks
Peak Execution Memory — maximum execution memory per task

The skew test: If max >> 10x median for any metric, you have skew.

Spark UI Environment Tab

Shows all configured memory parameters:

spark.executor.memory, spark.driver.memory
spark.memory.fraction, spark.memory.storageFraction
spark.executor.memoryOverhead
All JVM options (GC flags, off-heap settings)

Use this to verify your configuration is actually applied.

REST API

Programmatic access to executor metrics:

GET /api/v1/applications/{appId}/executors

Returns JSON with memory metrics per executor — useful for automated monitoring.

Prometheus + Grafana

For production monitoring, export Spark metrics to Prometheus:

spark.ui.prometheus.enabled = true

This exposes a /metrics/executors/prometheus endpoint that Prometheus can scrape.

Key metrics to dashboard:

Metric	Alert Threshold
`jvm.heap.used / jvm.heap.max`	> 85% sustained
`executor.GCTime / executor.runTime`	> 10%
`executor.diskBytesSpilled`	> 0 (sustained)
`executor.memoryBytesSpilled`	> 0 (sustained)
`executor.failedTasks`	> 0

JMX metrics sink:

# metrics.properties
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

Enables JMX exposure of all Spark metrics for tools like VisualVM, JConsole, or JMX-to-Prometheus exporters.

Grafana Dashboard Setup

Configure Spark to export to Prometheus (spark.ui.prometheus.enabled=true)
Configure Prometheus to scrape the Spark master/executor endpoints
Import a pre-built Spark dashboard (Grafana Dashboard ID 7890) or build custom panels
Key panels: heap usage over time, GC pause duration, spill bytes, executor memory distribution

Memory Anti-Patterns

1. Doubling Memory Without Diagnosis

The most expensive anti-pattern. A 16 GB executor OOM does not mean you need 32 GB — it often means one partition has 10x the data of others, and salting would fix it at 16 GB.

Fix: Always run the debugging workflow (Steps 1-5) before changing memory.

2. collect() on Large DataFrames

# This pulls ALL data to driver — guaranteed OOM for large DataFrames
all_rows = df.collect()

# This too — creates a Pandas DataFrame in driver memory
pandas_df = df.toPandas()

Fix: Process data on executors. Write results to storage. Use .take(N) for samples. Use pandas_udf for pandas operations on executors.

3. Caching Too Much Data

# Caching a 500 GB table on a cluster with 100 GB total memory
large_table.cache()
large_table.count()  # Triggers caching

# Result: eviction thrash, GC storms, performance worse than no cache

Fix: Cache only frequently reused, small-to-medium datasets. Monitor the Storage tab. Call unpersist() when done.

4. Broadcasting Tables That Expand in Memory

A 200 MB Parquet file can decompress to 1-2 GB in memory. The optimizer sees 200 MB (under the broadcast threshold) and broadcasts it. The driver needs 2-4 GB to hold the deserialized and serialized copies simultaneously.

Fix: Check actual in-memory size with df.cache(); df.count() and check the Storage tab. If the in-memory size is too large, use the MERGE hint to force SortMergeJoin.

5. Not Calling unpersist()

# Cache is created
intermediate_df.cache()
intermediate_df.count()

# ... use intermediate_df ...

# Cache is NEVER released — it stays until the application ends
# or is evicted by memory pressure

Fix: Always call intermediate_df.unpersist() when the cached data is no longer needed.

6. Too Many Cores Per Executor

spark.executor.memory = 16g
spark.executor.cores = 8

Per-task unified memory: (16,384 - 300) × 0.6 / 8 = 1,206 MB per task.

If any task needs more than 1.2 GB of execution memory (large shuffle, hash table), it fails.

Fix: Reduce cores per executor or increase memory. A common sweet spot is 4-5 cores per executor with proportional memory.

7. Too Few Shuffle Partitions

spark.sql.shuffle.partitions = 200  -- default

With 1 TB of data and 200 partitions, each partition is ~5 GB. If execution memory per task is 2 GB, every task spills heavily.

Fix: Increase shuffle partitions. A good starting rule: partition size should be 100-200 MB after shuffle.

spark.sql.shuffle.partitions = 5000  -- for 1 TB data: ~200 MB per partition

Or enable AQE coalescing to let Spark auto-tune:

spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true

8. PySpark UDFs Creating Object Pressure

Regular Python UDFs process rows one at a time through Py4J serialization:

@udf("double")
def slow_udf(x):
    return x * 2  # Each row: Python call overhead + serialization

Each row creates serialized objects on both JVM and Python sides.

Fix: Use pandas_udf (vectorized UDFs) that process entire columns:

@pandas_udf("double")
def fast_udf(x: pd.Series) -> pd.Series:
    return x * 2  # Processes entire column via Arrow

9. Ignoring Spill in Production

Spill to disk is a safety mechanism, not a normal operating mode. A job that spills 100 GB to disk per stage is running 5-10x slower than it should.

Fix: If you see sustained spill in production, either increase executor memory or increase partition count to reduce per-task data volume.

Memory Optimization Techniques

Right-Sizing Executors

Start with a moderate configuration and scale based on metrics:

# Starting point for most workloads
spark.executor.memory = 8g
spark.executor.cores = 4
spark.executor.memoryOverhead = 2g

# Per-task memory: (8192 - 300) × 0.6 / 4 = 1,183 MB

Monitor spill, GC time, and memory usage. If spill > 0 or GC > 10%, increase memory. If utilization is consistently < 50%, decrease memory.

Kryo Serialization

spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired = false

Reduces shuffle data size by 2-10x and GC overhead from serialization objects.

AQE Automatic Tuning

spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
spark.sql.adaptive.skewJoin.enabled = true

AQE coalesces small post-shuffle partitions (reducing scheduling overhead) and splits skewed partitions (preventing OOM on straggler tasks).

Filter and Project Early

# Anti-pattern: join full tables, then filter and select
result = orders.join(customers, "customer_id") \
    .filter(col("order_date") >= "2024-01-01") \
    .select("order_id", "customer_name")

# Better: filter and select before join
filtered_orders = orders \
    .filter(col("order_date") >= "2024-01-01") \
    .select("order_id", "customer_id")
filtered_customers = customers.select("customer_id", "customer_name")

result = filtered_orders.join(broadcast(filtered_customers), "customer_id")

Fewer rows and columns entering the join = less shuffle data = less memory pressure = less spill.

Use SortMergeJoin for Memory Safety

ShuffledHashJoin builds a hash table from the build side's local partition. If the partition is larger than available memory, the task OOMs with no recovery.

SortMergeJoin sorts both sides and merges — it spills to disk when memory is exhausted. It is slower but never OOMs on data size alone.

# Force SMJ when memory safety is more important than speed
result = df1.hint("merge").join(df2, "key")

Off-Heap for GC-Bound Workloads

If GC is consistently > 15% of task time despite tuning:

spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 4g  # Add to executor.memoryOverhead too
spark.executor.memoryOverhead = 6g  # 2g regular + 4g off-heap

Off-heap data is invisible to the GC, reducing heap pressure.

The Complete Debugging Flowchart

OOM Error
    |
    ├── Where did it happen?
    │     ├── Driver log → Driver OOM
    │     │     ├── Is there a collect()/toPandas()? → Remove it
    │     │     ├── Is there a large broadcast? → Reduce or use MERGE hint
    │     │     ├── Large accumulators? → Simplify accumulators
    │     │     └── Increase spark.driver.memory (last resort)
    │     │
    │     ├── Executor log → Executor OOM
    │     │     ├── Which error type?
    │     │     │     ├── Java heap space → Step 2 below
    │     │     │     ├── GC overhead limit → GC tuning + increase memory
    │     │     │     ├── Direct buffer → Increase memoryOverhead
    │     │     │     ├── Metaspace → Add -XX:MaxMetaspaceSize=512m
    │     │     │     └── SparkOutOfMemoryError → Step 2 below
    │     │     │
    │     │     ├── Step 2: Check Stages tab for failed task
    │     │     │     ├── Max >> 10x Median? → Data skew (fix skew, not memory)
    │     │     │     ├── All tasks failing? → Partition too large (increase partitions)
    │     │     │     └── One task failing? → Skewed key or outlier data
    │     │     │
    │     │     ├── Step 3: Check spill metrics
    │     │     │     ├── No spill + OOM → Hash join can't fit → Use SMJ
    │     │     │     ├── Large spill + OOM → Increase partitions or memory
    │     │     │     └── Large spill, no OOM → Increase memory for performance
    │     │     │
    │     │     └── Step 4: Check GC time
    │     │           ├── GC > 10% → Enable GC logging, tune G1GC
    │     │           ├── GC > 30% → Consider off-heap or increase memory
    │     │           └── GC < 5% → Not GC-related
    │     │
    │     └── Container killed (no stack trace) → Container OOM
    │           ├── YARN exit 137 → Increase memoryOverhead
    │           ├── K8s OOMKilled → Increase memoryOverheadFactor
    │           └── Check if off-heap or PySpark memory is accounted for
    │
    └── Fix the root cause, not the symptom.
        Do NOT just double executor memory.

How Cazpian Eliminates the Guesswork

The debugging workflow described above requires deep Spark expertise, access to Spark UI, knowledge of GC internals, and time to investigate. Most data engineering teams do not have this. They double memory and move on.

Cazpian changes this in three ways:

SQL Editor: Compute Usage After Every Query

When a query completes in the Cazpian SQL editor, the results include compute metrics alongside the data:

Memory used — execution, storage, and peak unified memory
Spill metrics — whether data spilled to disk and how much
Shuffle metrics — bytes shuffled, partitions read/written
Table metrics — rows scanned, files read, bytes read

Teams see immediately whether their query was efficient or wasteful — before it becomes a production problem. No digging through Spark UI. No guessing.

Job-Level Observability

For batch pipelines, Cazpian provides per-job memory breakdown:

Which stages consumed the most memory
Whether any stage spilled
GC time percentage per stage
Data skew indicators (max vs median task metrics)

This surfaces the information that the debugging flowchart above requires — automatically, for every job.

AI Studio for Failure Debugging

When a job fails with OOM, Cazpian's AI Studio can analyze the failure:

Identifies the OOM type (driver, executor, container)
Pinpoints the failed stage and root cause (skew, oversized broadcast, cache pressure)
Suggests specific fixes (increase partitions, add salt, reduce broadcast threshold, tune GC)
Provides the exact configuration changes needed

Instead of a senior engineer spending 30 minutes tracing through logs and Spark UI, AI Studio delivers the diagnosis in seconds. The team fixes the root cause instead of doubling memory.

Configuration Reference

Memory Sizing

Configuration	Default	Description
`spark.executor.memory`	`1g`	JVM heap per executor
`spark.driver.memory`	`1g`	JVM heap for driver
`spark.driver.maxResultSize`	`1g`	Max serialized result per action
`spark.memory.fraction`	`0.6`	Fraction of heap for unified pool
`spark.memory.storageFraction`	`0.5`	Protected storage floor
`spark.memory.offHeap.enabled`	`false`	Enable off-heap
`spark.memory.offHeap.size`	`0`	Off-heap allocation

Memory Overhead

Configuration	Default	Description
`spark.executor.memoryOverhead`	`max(mem × 0.10, 384 MB)`	Executor off-heap overhead
`spark.driver.memoryOverhead`	`max(mem × 0.10, 384 MB)`	Driver off-heap overhead
`spark.yarn.executor.memoryOverhead`	`max(mem × 0.07, 384 MB)`	YARN-specific override
`spark.kubernetes.memoryOverheadFactor`	`0.10` / `0.40`	K8s overhead factor

GC Options

JVM Flag	Recommended	Description
`-XX:+UseG1GC`	Yes	Use G1 garbage collector
`-XX:InitiatingHeapOccupancyPercent`	`35`	Start marking earlier
`-XX:G1HeapRegionSize`	`32m`	Reduce humongous allocations
`-XX:MaxGCPauseMillis`	`200`	Target pause time
`-XX:+UseZGC -XX:+ZGenerational`	For low-latency	Sub-ms pauses (Java 21+)
`-XX:MaxMetaspaceSize`	`512m`	Prevent Metaspace OOM
`-XX:MaxDirectMemorySize`	`2g`	Prevent direct buffer OOM

Partition Tuning

Configuration	Default	Description
`spark.sql.shuffle.partitions`	`200`	Shuffle partitions (more = less memory per task)
`spark.sql.files.maxPartitionBytes`	`128 MB`	Max input partition size
`spark.sql.adaptive.coalescePartitions.enabled`	`true`	AQE auto-coalesce
`spark.sql.adaptive.skewJoin.enabled`	`true`	AQE skew handling

Serialization

Configuration	Default	Description
`spark.serializer`	`JavaSerializer`	Use KryoSerializer for 2-10x improvement
`spark.kryoserializer.buffer.max`	`64m`	Max Kryo buffer

Summary

OOM errors are symptoms, not root causes. The root cause is always one of: data skew, oversized broadcast, missing unpersist, collect to driver, wrong join strategy, or genuinely insufficient memory for the data volume.

The debugging protocol:

Driver or executor? Check where the error appeared
Which error type? Each type has a different cause and fix
Which stage and task? Find the failed task in Spark UI
Skew check. Compare max vs median metrics — if max >> 10x median, fix skew first
Spill check. No spill + OOM means the operation cannot spill (hash join) — change the join strategy
GC check. If GC > 10% of task time, tune GC before adding memory
Fix the root cause. Not the symptom

The next time a job OOMs, resist the urge to double memory. Open the Spark UI. Run the flowchart. Fix what actually broke. Your cluster bill will thank you.

Every Type of OOM Error​

1. Java Heap Space​

2. GC Overhead Limit Exceeded​

3. Direct Buffer Memory​

4. Metaspace​

5. Container Killed by YARN​

6. Kubernetes OOMKilled​

7. SparkOutOfMemoryError​

Step 1: Driver or Executor?​

How to Tell​

Driver OOM Debugging Path​

Executor OOM Debugging Path​

Step 2: Which Stage and Task Failed?​

What to Look For​

The Skew Check​

Step 3: Check Spill Metrics​

What Spill Tells You​

Spill Ratio​

Step 4: Check GC Time​

What GC Time Tells You​

GC Time Patterns​

Step 5: Enable GC Logging​

What to Look For in GC Logs​

GC Tuning for Spark​

G1GC (Recommended for Most Workloads)​

ZGC (For Ultra-Low Latency)​

How Cached Data Affects GC​

Kryo vs Java Serialization​

Memory Observability​

Spark UI Executors Tab​

Spark UI Storage Tab​

Spark UI Stages Tab — Summary Metrics​

Spark UI Environment Tab​

REST API​

Prometheus + Grafana​

Grafana Dashboard Setup​

Memory Anti-Patterns​

1. Doubling Memory Without Diagnosis​

2. collect() on Large DataFrames​

3. Caching Too Much Data​

4. Broadcasting Tables That Expand in Memory​

5. Not Calling unpersist()​

6. Too Many Cores Per Executor​

7. Too Few Shuffle Partitions​

8. PySpark UDFs Creating Object Pressure​

9. Ignoring Spill in Production​

Memory Optimization Techniques​

Right-Sizing Executors​

Kryo Serialization​

AQE Automatic Tuning​

Filter and Project Early​

Use SortMergeJoin for Memory Safety​

Off-Heap for GC-Bound Workloads​

The Complete Debugging Flowchart​

How Cazpian Eliminates the Guesswork​

SQL Editor: Compute Usage After Every Query​

Job-Level Observability​

AI Studio for Failure Debugging​

Configuration Reference​

Memory Sizing​

Memory Overhead​

GC Options​

Partition Tuning​

Serialization​

Summary​

Every Type of OOM Error

1. Java Heap Space

2. GC Overhead Limit Exceeded

3. Direct Buffer Memory

4. Metaspace

5. Container Killed by YARN

6. Kubernetes OOMKilled

7. SparkOutOfMemoryError

Step 1: Driver or Executor?

How to Tell

Driver OOM Debugging Path

Executor OOM Debugging Path

Step 2: Which Stage and Task Failed?

What to Look For

The Skew Check

Step 3: Check Spill Metrics

What Spill Tells You

Spill Ratio

Step 4: Check GC Time

What GC Time Tells You

GC Time Patterns

Step 5: Enable GC Logging

What to Look For in GC Logs

GC Tuning for Spark

G1GC (Recommended for Most Workloads)

ZGC (For Ultra-Low Latency)

How Cached Data Affects GC

Kryo vs Java Serialization

Memory Observability

Spark UI Executors Tab

Spark UI Storage Tab

Spark UI Stages Tab — Summary Metrics

Spark UI Environment Tab

REST API

Prometheus + Grafana

Grafana Dashboard Setup

Memory Anti-Patterns

1. Doubling Memory Without Diagnosis

2. collect() on Large DataFrames

3. Caching Too Much Data

4. Broadcasting Tables That Expand in Memory

5. Not Calling unpersist()

6. Too Many Cores Per Executor

7. Too Few Shuffle Partitions

8. PySpark UDFs Creating Object Pressure

9. Ignoring Spill in Production

Memory Optimization Techniques

Right-Sizing Executors

Kryo Serialization

AQE Automatic Tuning

Filter and Project Early

Use SortMergeJoin for Memory Safety

Off-Heap for GC-Bound Workloads

The Complete Debugging Flowchart

How Cazpian Eliminates the Guesswork

SQL Editor: Compute Usage After Every Query

Job-Level Observability

AI Studio for Failure Debugging

Configuration Reference

Memory Sizing

Memory Overhead

GC Options

Partition Tuning

Serialization

Summary