Skip to main content

Spark Memory Architecture: The Complete Guide to the Unified Memory Model

· 17 min read
Cazpian Engineering
Platform Engineering Team

Spark Memory Architecture: The Complete Guide to the Unified Memory Model

A Spark job fails with OutOfMemoryError. The team doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody investigates why. Three months later, the same job fails again on a larger dataset. They double it again to 32 GB. The cluster bill doubles with it.

This is the most common pattern in Spark operations — and the most expensive. Teams treat memory as a single knob to turn up. They do not know that Spark splits memory into distinct regions with different purposes, different eviction rules, and different failure modes. They do not know that their 16 GB executor only gives Spark 9.42 GB of usable unified memory. They do not know that their driver OOM has nothing to do with executor memory. They do not know that half their container memory is overhead they never configured.

This post explains exactly how Spark manages memory. We cover the unified memory model with exact formulas, the difference between execution and storage memory, driver vs executor memory architecture, off-heap memory with Tungsten, container memory for YARN and Kubernetes, PySpark-specific memory, and how to calculate every region from your configuration. The companion post covers OOM debugging, GC tuning, and observability.

The Unified Memory Model

Before Spark 1.6, memory management used a static model with fixed, non-flexible boundaries between execution and storage. If execution memory was full but storage was empty, execution could not use the idle storage memory — it would simply spill to disk. This was wasteful.

Detailed diagram of Spark's unified memory model showing JVM heap regions, execution vs storage memory with dynamic borrowing, off-heap Tungsten memory, driver vs executor layout, and container memory formulas

Spark 1.6 introduced the Unified Memory Manager (SPARK-10000), which replaced the static model with a dynamic one. Execution and storage share a single pool and can borrow from each other.

The Four Memory Regions

Every Spark executor's JVM heap is divided into four regions:

┌─────────────────────────────────────────────────────┐
│ JVM Heap (spark.executor.memory) │
│ │
│ ┌───────────────┐ │
│ │ Reserved │ 300 MB (hardcoded) │
│ │ Memory │ Internal Spark structures │
│ └───────────────┘ │
│ │
│ ┌───────────────┐ │
│ │ User │ (heap - 300MB) × (1 - 0.6) │
│ │ Memory │ Your objects, UDFs, metadata │
│ └───────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Unified Memory (heap - 300MB) × 0.6 │ │
│ │ │ │
│ │ ┌───────────────────┐ ┌───────────────────┐ │ │
│ │ │ Storage Memory │ │ Execution Memory │ │ │
│ │ │ (initial 50%) │ │ (initial 50%) │ │ │
│ │ │ │ │ │ │ │
│ │ │ Cached DFs/RDDs │ │ Shuffle buffers │ │ │
│ │ │ Broadcast vars │ │ Sort buffers │ │ │
│ │ │ Task results │ │ Hash tables │ │ │
│ │ │ │ │ Aggregation maps │ │ │
│ │ └───────────────────┘ └───────────────────┘ │ │
│ │ ← dynamic boundary → │ │
│ └─────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────┘

Region 1: Reserved Memory (300 MB)

A hardcoded 300 MB set aside for Spark's internal structures — logging, metrics, internal bookkeeping. This is not configurable in production. There is a spark.testing.reservedMemory parameter but it is explicitly for testing only and should never be changed.

Implication: Your executor must have at least 300 MB + some usable space. In practice, never set spark.executor.memory below 1 GB.

Region 2: User Memory

User Memory = (spark.executor.memory - 300 MB) × (1 - spark.memory.fraction)

With default spark.memory.fraction=0.6:

User Memory = (spark.executor.memory - 300 MB) × 0.4

User memory holds:

  • Your custom data structures (HashMap, ArrayList, etc.)
  • Internal metadata for data processing
  • UDF state and closures
  • RDD dependency information
  • Spark internal objects not managed by the memory manager

This region is not managed by Spark's unified memory manager. If you allocate too many objects here, you get a standard java.lang.OutOfMemoryError: Java heap space with no help from Spark's eviction or spill mechanisms.

Region 3: Unified Memory — Storage

Protected Storage = (spark.executor.memory - 300 MB) × spark.memory.fraction × spark.memory.storageFraction

With defaults:

Protected Storage = (spark.executor.memory - 300 MB) × 0.6 × 0.5
= (spark.executor.memory - 300 MB) × 0.3

Storage memory holds:

  • Cached DataFrames and RDDs — data stored via .cache() or .persist()
  • Broadcast variables — hash tables and data sent to all executors
  • Unroll memory — space used when deserializing blocks for caching

The protected storage portion (defined by spark.memory.storageFraction) is immune to eviction by execution. Storage above this protected floor can be evicted when execution needs space.

Region 4: Unified Memory — Execution

Initial Execution = (spark.executor.memory - 300 MB) × spark.memory.fraction × (1 - spark.memory.storageFraction)

With defaults:

Initial Execution = (spark.executor.memory - 300 MB) × 0.6 × 0.5
= (spark.executor.memory - 300 MB) × 0.3

Execution memory holds:

  • Shuffle buffers — data being serialized and written during shuffle
  • Sort buffersUnsafeExternalSorter pages for SortMergeJoin and sort operations
  • Hash tablesHashedRelation for BroadcastHashJoin and ShuffledHashJoin
  • Aggregation maps — hash maps for GROUP BY operations

Execution memory has priority over storage. If execution needs more space and storage is occupying it, storage blocks are evicted (using LRU). The reverse is not true — execution memory borrowed by storage is never forcibly reclaimed.

How Dynamic Memory Sharing Works

The unified memory model's key feature is that the boundary between execution and storage is soft. Both regions can expand into the other's space when it is idle.

Borrowing Rules

Storage borrows from execution: When storage needs more memory and execution has unused capacity, storage expands into execution's region. However, if execution later needs that memory back, storage must give it up — cached blocks are evicted using LRU (Least Recently Used).

Execution borrows from storage: When execution needs more memory and storage has unused capacity, execution expands into storage's region. If storage later needs memory, execution does not give it back. Storage must wait for execution to release naturally or spill cached data to disk.

Why Execution Has Priority

Execution memory holds intermediate computation state — shuffle buffers, sort arrays, hash tables. If execution runs out of memory mid-operation, the task fails with SparkOutOfMemoryError. Recovering from this requires restarting the entire task.

Storage memory holds cached data. If cached data is evicted, it can be recomputed from the lineage (RDD) or re-read from disk (MEMORY_AND_DISK). Eviction is inconvenient but not fatal.

This asymmetry is intentional: losing execution state is worse than losing cached data.

The Protected Storage Floor

The spark.memory.storageFraction (default 0.5) defines the protected floor below which execution cannot evict storage. Even under extreme execution pressure, Spark guarantees that storage retains at least this fraction of the unified region.

Protected Floor = Unified Memory × spark.memory.storageFraction
= (spark.executor.memory - 300 MB) × 0.6 × 0.5

Anything cached above this floor is fair game for eviction. Anything below it is safe.

Exact Memory Calculations

Example: 16 GB Executor

spark.executor.memory = 16 GB (16,384 MB)
spark.memory.fraction = 0.6
spark.memory.storageFraction = 0.5
RegionFormulaSize
ReservedHardcoded300 MB
Usable Heap16,384 - 30016,084 MB
User Memory16,084 × 0.46,434 MB
Unified Memory16,084 × 0.69,650 MB
→ Storage (initial/protected)9,650 × 0.54,825 MB
→ Execution (initial)9,650 × 0.54,825 MB

Key insight: Of your 16 GB executor, Spark only has 9.65 GB of managed unified memory. The rest is reserved (300 MB) and user space (6.4 GB). When you see "executor memory" in configurations, the usable Spark-managed portion is only 60% of the heap minus 300 MB.

Example: 4 GB Executor

spark.executor.memory = 4 GB (4,096 MB)
RegionSize
Reserved300 MB
User Memory(4,096 - 300) × 0.4 = 1,518 MB
Unified Memory(4,096 - 300) × 0.6 = 2,278 MB
→ Storage (protected)1,139 MB
→ Execution (initial)1,139 MB

Only 2.28 GB of managed memory from a 4 GB executor. This is why small executors struggle with large shuffles and caching.

Example: 32 GB Executor

spark.executor.memory = 32 GB (32,768 MB)
RegionSize
Reserved300 MB
User Memory(32,768 - 300) × 0.4 = 12,987 MB
Unified Memory(32,768 - 300) × 0.6 = 19,481 MB
→ Storage (protected)9,740 MB
→ Execution (initial)9,740 MB

19.5 GB of managed memory. This is where broadcast joins and large caches start to become comfortable.

Driver Memory Architecture

The driver is fundamentally different from executors. It coordinates the application but does not process data partitions. Its memory layout serves a different purpose.

What the Driver Holds

  • SparkContext — the entry point object for all Spark operations
  • DAGScheduler and TaskScheduler — the scheduling infrastructure for all stages and tasks
  • Broadcast variables — the driver builds and holds the full copy before broadcasting
  • Collect results — any data pulled back via .collect(), .toPandas(), .take(), .show()
  • Accumulator values — aggregated from all executors
  • Query plans — for complex queries, the catalyst plan tree itself can be large

Driver Memory Configuration

spark.driver.memory = 1g (default)

The driver's JVM heap. Controls how much data the driver can hold in memory.

spark.driver.memoryOverhead = max(driverMemory × 0.10, 384 MB)

Off-heap memory for the driver process — NIO direct buffers, native memory, JVM internal overhead.

spark.driver.maxResultSize = 1g (default)

Maximum total size of serialized results from all partitions for a single action (like collect()). If a result exceeds this, the job is aborted. This is a safety net — not the memory allocation. Setting this to 0 disables the limit, which is dangerous.

Common Driver OOM Scenarios

1. Large collect():

# This pulls ALL data to the driver — OOM if df is large
all_data = df.collect()

# This too — the entire DataFrame becomes a Pandas DataFrame in driver memory
pandas_df = df.toPandas()

2. Large broadcast:

The driver must hold the entire broadcast table in memory while building the HashedRelation. A 500 MB Parquet table can expand to 2+ GB in memory.

3. Too many accumulator values:

Each task sends accumulator updates back to the driver. With millions of tasks and custom accumulators, the driver accumulates significant state.

4. Complex query plans:

Queries with hundreds of columns, deep subquery nesting, or many unions can create large catalyst plan trees that consume driver heap.

Right-Sizing the Driver

For most workloads:

spark.driver.memory = 2g to 8g

If your workflow includes collect(), toPandas(), or broadcast of tables > 100 MB:

spark.driver.memory = 8g to 16g

If you do not use any of these and the driver is purely a coordinator:

spark.driver.memory = 2g to 4g

Executor Memory Architecture

Each executor runs tasks that process data partitions. Executor memory is where the actual computation happens.

The Full Executor Memory Layout

┌─────────────────────────────────────────────────────────┐
│ Container Memory │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ JVM Heap (spark.executor.memory) │ │
│ │ │ │
│ │ Reserved (300 MB) │ │
│ │ User Memory (heap - 300) × 0.4 │ │
│ │ Unified Memory (heap - 300) × 0.6 │ │
│ │ - Storage (cached data, broadcasts) │ │
│ │ - Execution (shuffles, sorts, joins) │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Memory Overhead (spark.executor.memoryOverhead)│ │
│ │ │ │
│ │ NIO Direct Buffers (Netty) │ │
│ │ Thread Stacks │ │
│ │ JVM Internal Structures │ │
│ │ Native Memory (JNI) │ │
│ │ Compressed Class Space / Metaspace │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Off-Heap Memory (spark.memory.offHeap.size) │ │
│ │ (only if spark.memory.offHeap.enabled=true) │ │
│ │ │ │
│ │ Tungsten managed memory │ │
│ │ Execution: shuffles, sorts, hash tables │ │
│ │ Storage: off-heap columnar cache │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ PySpark Memory (spark.executor.pyspark.memory)│ │
│ │ (only for PySpark workloads) │ │
│ │ │ │
│ │ Python worker process memory │ │
│ │ Pandas DataFrames, NumPy arrays │ │
│ │ Arrow serialization buffers │ │
│ └──────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘

Container Memory Formula

Total Container Memory = spark.executor.memory
+ spark.executor.memoryOverhead
+ spark.memory.offHeap.size (if enabled)
+ spark.executor.pyspark.memory (if set)

This total is what YARN or Kubernetes allocates for the container/pod. If the process exceeds this, the container is killed.

Memory Overhead Defaults

The overhead default depends on the deployment platform:

PlatformDefault FormulaMinimum
YARNexecutorMemory × 0.07384 MB
Kubernetes (JVM)executorMemory × 0.10384 MB
Kubernetes (non-JVM)executorMemory × 0.40384 MB

What needs overhead memory:

  • Netty NIO buffers — shuffle data transfer uses direct byte buffers outside JVM heap
  • Thread stacks — each thread needs 512 KB–1 MB of native stack space
  • JVM internal structures — class metadata (Metaspace), JIT compiled code, internal tables
  • JNI native memory — any native libraries loaded by Spark or your code
  • Compressed oops — JVM pointer compression overhead

Per-Task Memory

Each executor runs spark.executor.cores (default 1) tasks concurrently. Each task shares the executor's unified memory pool, managed by TaskMemoryManager.

Available memory per task ≈ Unified Memory / spark.executor.cores

With a 16 GB executor and 4 cores:

Per-task unified memory ≈ 9,650 MB / 4 = 2,412 MB

This is why increasing cores without increasing memory can cause OOM — each task gets less memory.

Off-Heap Memory

Off-heap memory stores data outside the JVM heap, in native memory managed by Spark's Tungsten engine. The primary benefit: zero GC overhead. Data in off-heap is invisible to the garbage collector.

Configuration

spark.memory.offHeap.enabled = false (default)
spark.memory.offHeap.size = 0 (default, in bytes)

When enabled, Spark creates a second unified memory pool in off-heap space, with the same execution/storage split. The on-heap unified pool still exists — both pools are used.

How Tungsten Uses Off-Heap

Tungsten is Spark's memory management engine that operates on raw binary data:

  • Uses sun.misc.Unsafe (or java.lang.foreign.MemorySegment in newer JDKs) for direct memory access
  • Allocates memory in pages (default 16 MB page size)
  • Stores data in compact binary format — no Java object headers, no boxing
  • Enables cache-friendly data layouts for better CPU performance

Operations that benefit from off-heap:

  • Shuffle sort and merge
  • Hash table construction for joins and aggregations
  • Columnar cache storage (spark.sql.columnInMemory.offHeap.enabled)

When to Use Off-Heap

Use off-heap when:

  • GC pauses are causing performance problems (> 10% of task time)
  • You have large shuffles or hash tables that create GC pressure
  • You need predictable latency without GC spikes

Avoid off-heap when:

  • Your workload is not GC-bound
  • You cannot increase container memory limits (off-heap adds to total memory)
  • You are debugging memory issues (off-heap is harder to inspect)

Container Memory Impact

Off-heap memory is not inside the JVM heap. It must be accounted for in the container's total memory:

Container Memory = spark.executor.memory + memoryOverhead + spark.memory.offHeap.size

If you enable 4 GB of off-heap but do not increase the container limit, the process will be killed for exceeding memory.

Example configuration:

spark.executor.memory = 12g
spark.executor.memoryOverhead = 2g
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 4g

Total container memory: 12 + 2 + 4 = 18 GB.

Container Memory: YARN and Kubernetes

Understanding container memory is critical because the most confusing OOM errors come from the container being killed — not from the JVM running out of heap.

YARN Container Memory

YARN Container Memory = spark.executor.memory
+ spark.yarn.executor.memoryOverhead
+ spark.memory.offHeap.size
+ spark.executor.pyspark.memory

YARN overhead default:

spark.yarn.executor.memoryOverhead = max(spark.executor.memory × 0.07, 384 MB)

Example: 20 GB executor on YARN:

JVM Heap:   20 GB
Overhead: max(20 × 0.07, 0.384) = 1.4 GB
Container: 21.4 GB

The YARN NodeManager checks physical RSS memory of the container. If it exceeds the container limit, YARN kills the container immediately with no Java stack trace — just a log message: "Container killed by YARN for exceeding memory limits." The exit code is 137.

Kubernetes Pod Memory

K8s Pod Memory = spark.executor.memory
+ spark.kubernetes.executor.memoryOverhead
+ spark.memory.offHeap.size
+ spark.executor.pyspark.memory

Kubernetes overhead default:

spark.kubernetes.memoryOverheadFactor = 0.10 (JVM) or 0.40 (non-JVM like PySpark)
spark.kubernetes.executor.memoryOverhead = max(spark.executor.memory × factor, 384 MB)

Example: 16 GB executor on Kubernetes (PySpark):

JVM Heap:    16 GB
Overhead: max(16 × 0.10, 0.384) = 1.6 GB
PySpark: 4 GB
Pod Memory: 21.6 GB

When a pod exceeds its memory limit, Kubernetes sends OOMKilled and the pod is terminated. You see this in kubectl describe pod output.

Why Container Kills Are Confusing

Container kills produce no Java stack trace. The process is killed by the operating system or container runtime, not by the JVM. This makes them the hardest OOM errors to diagnose because:

  1. No OutOfMemoryError in application logs
  2. The Spark UI may not capture the failure
  3. The error message only says "exceeded memory limits" without telling you what consumed the memory
  4. The root cause is usually off-heap memory (NIO buffers, native memory) that is not visible in JVM heap metrics

Right-Sizing Overhead

Workload TypeRecommended Overhead
Scala/Java, light shuffleDefault (7-10%)
Heavy shuffle, large broadcast15-20%
PySpark with UDFs20-30%
PySpark with pandas_udf + Arrow25-40%
Off-heap enabledAdd full offHeap.size separately

Rule of thumb: Start with default overhead. If you see container kills but no JVM OOM in logs, increase overhead by 50% and monitor again.

PySpark Memory Architecture

PySpark adds a separate Python process alongside the JVM executor, creating a dual-memory architecture.

How PySpark Memory Works

┌─────────────────────────────────────┐
│ Container │
│ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ JVM Executor │ │ Python Worker│ │
│ │ │ │ │ │
│ │ Spark Core │←→│ PySpark API │ │
│ │ Catalyst │ │ Pandas UDFs │ │
│ │ Tungsten │ │ NumPy/SciPy │ │
│ │ │ │ │ │
│ │ executor. │ │ pyspark. │ │
│ │ memory │ │ memory │ │
│ └─────────────┘ └──────────────┘ │
│ │
│ + memoryOverhead (shared) │
└─────────────────────────────────────┘

Key PySpark Memory Configurations

spark.executor.pyspark.memory    -- not set by default

Limits the Python worker process memory. When set, Spark configures RLIMIT_AS to restrict the Python process. Without this, the Python worker can consume unbounded memory.

spark.python.worker.memory       -- not set by default

Controls when Spark spills Python objects to disk in the JVM-side bridge. This is for RDD operations through the Py4J bridge.

spark.sql.execution.arrow.pyspark.enabled = false (default)

Enables Apache Arrow for efficient data transfer between JVM and Python. With Arrow, data is transferred as columnar batches instead of row-by-row serialization — 10-100x faster.

spark.sql.execution.arrow.maxRecordsPerBatch = 10000

Maximum rows per Arrow record batch. Larger batches are more efficient but use more memory per batch.

PySpark Memory Traps

1. toPandas() pulls everything to driver:

# This collects ALL data to the driver, then converts to Pandas
# Both the Spark collect AND the Pandas DataFrame must fit in driver memory
pandas_df = spark_df.toPandas()

# For a 10 GB DataFrame, you need ~20+ GB of driver memory

2. pandas_udf loads entire partitions:

@pandas_udf("double")
def my_func(v: pd.Series) -> pd.Series:
return v * 2 # Entire partition loaded as Pandas Series

If a partition has 50 million rows, the entire partition is materialized as a Pandas DataFrame in the Python worker. Increase partition count to reduce per-partition size.

3. Arrow serialization overhead:

Arrow transfers between JVM and Python create temporary copies. With large batches, this can double the effective memory usage temporarily.

spark.executor.memory = 8g
spark.executor.memoryOverhead = 2g
spark.executor.pyspark.memory = 4g
spark.sql.execution.arrow.pyspark.enabled = true
spark.sql.execution.arrow.maxRecordsPerBatch = 10000

Total container: 8 + 2 + 4 = 14 GB.

Memory Configuration Reference

Core Memory

ConfigurationDefaultDescription
spark.executor.memory1gJVM heap per executor
spark.driver.memory1gJVM heap for driver
spark.driver.maxResultSize1gMax serialized result size per action
spark.executor.cores1Concurrent tasks per executor (affects per-task memory)

Unified Memory Model

ConfigurationDefaultDescription
spark.memory.fraction0.6Fraction of (heap - 300 MB) for unified pool
spark.memory.storageFraction0.5Protected storage floor within unified pool

Off-Heap

ConfigurationDefaultDescription
spark.memory.offHeap.enabledfalseEnable off-heap memory
spark.memory.offHeap.size0Off-heap allocation in bytes

Memory Overhead

ConfigurationDefaultDescription
spark.executor.memoryOverheadmax(memory × 0.10, 384 MB)Executor off-heap overhead
spark.driver.memoryOverheadmax(memory × 0.10, 384 MB)Driver off-heap overhead
spark.yarn.executor.memoryOverheadmax(memory × 0.07, 384 MB)YARN-specific override
spark.kubernetes.memoryOverheadFactor0.10 (JVM), 0.40 (non-JVM)K8s overhead factor

PySpark

ConfigurationDefaultDescription
spark.executor.pyspark.memorynot setPython worker memory limit
spark.python.worker.memorynot setPy4J spill threshold
spark.sql.execution.arrow.pyspark.enabledfalseEnable Arrow for PySpark
spark.sql.execution.arrow.maxRecordsPerBatch10000Rows per Arrow batch

Serialization (Affects Memory Usage)

ConfigurationDefaultDescription
spark.serializerJavaSerializerSerializer class (Kryo is 2-10x more compact)
spark.kryoserializer.buffer64kInitial Kryo buffer
spark.kryoserializer.buffer.max64mMax Kryo buffer

Shuffle and Execution (Affects Memory Usage)

ConfigurationDefaultDescription
spark.sql.shuffle.partitions200Partitions per shuffle (more = less data per task)
spark.shuffle.file.buffer32kShuffle file output buffer
spark.reducer.maxSizeInFlight48mMax shuffle data in flight per reduce task
spark.sql.files.maxPartitionBytes128 MBMax bytes per partition when reading files

How Cazpian Uses Memory Architecture

Cazpian's compute pools are configured with production-tested memory defaults for Iceberg workloads on S3:

  • Unified memory sizing is tuned for the mix of execution (shuffle-heavy joins) and storage (cached dimension tables) typical in lakehouse queries
  • Container memory accounts for off-heap overhead from Netty shuffle and Arrow serialization
  • PySpark memory is pre-configured when Python runtimes are selected
  • Per-query memory tracking in the SQL editor shows exactly how much memory each query consumed — unified, execution, storage, spill — so teams can right-size without guesswork

The companion post — Spark OOM Debugging: The Complete Guide — covers how to diagnose and fix every type of OOM error using Spark UI, GC analysis, and observability tools.

Summary

Spark's unified memory model is elegant but complex. The key takeaways:

  1. Only 60% of your heap is managed by Spark. The rest is reserved (300 MB) and user memory. A 16 GB executor has ~9.65 GB of unified memory
  2. Execution has priority over storage. Execution can evict cached data, but storage cannot evict execution buffers. This is by design — computation failures are more expensive than cache misses
  3. The boundary between execution and storage is dynamic. They borrow from each other, but only storage can be forcibly evicted
  4. Container memory = heap + overhead + off-heap + PySpark. Container kills (exit 137, OOMKilled) happen when the total exceeds the container limit — not when the JVM heap is full
  5. Driver and executor OOM are completely different problems. Driver OOM is usually caused by collect(), broadcast, or accumulators. Executor OOM is caused by large shuffles, skewed partitions, or insufficient memory per task
  6. PySpark doubles the memory challenge. A separate Python process runs alongside the JVM, and both need memory within the same container

The next time you see an OOM error, do not double the memory. Calculate which region overflowed, understand what filled it, and fix the root cause.