Skip to main content

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

· 11 min read
Cazpian Engineering
Platform Engineering Team

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

In Part 1 of this series, we exposed the Small Job Tax — the hidden cost of cold starts, overprovisioned clusters, and per-job infrastructure overhead that silently drains data budgets. We showed that for many teams, more than half of their Spark compute spend goes to infrastructure bootstrapping, not actual data processing.

The natural follow-up question: what if you could eliminate that overhead entirely?

That is exactly what Cazpian Compute Pools are built to do.

The Core Problem: Every Job Pays an Entry Fee

In a traditional Spark deployment — whether on Databricks job clusters, EMR, or self-managed Kubernetes — every job submission triggers the same sequence:

  1. Request compute resources from the cloud provider
  2. Wait for VMs to provision, images to pull, volumes to attach
  3. Bootstrap the Spark runtime (JVM, SparkContext, executor registration)
  4. Run the actual job
  5. Tear down everything

Steps 1-3 take 2-5 minutes. For a small job that runs in 60 seconds, you are paying for 3-6 minutes of infrastructure to get 1 minute of work. This is the entry fee, and every single job pays it — whether it processes 500 MB or 500 GB.

The industry has tried to solve this with faster provisioning, pre-warmed instance pools, and serverless options that reduce startup to 15-30 seconds. These are improvements. But they still treat each job as an isolated event that needs its own Spark runtime, its own driver, and its own set of executors.

Cazpian takes a fundamentally different approach.

How Cazpian Compute Pools Work

Instead of spinning up and tearing down compute for every job, Cazpian maintains persistent, warm compute pools — shared Spark environments that are always running and ready to accept work.

The Architecture

A Cazpian Compute Pool consists of three layers:

The Warm Driver Layer. A persistent Spark driver process runs continuously within the pool. It maintains an active SparkContext, pre-loaded configurations, and live connections to your Iceberg catalog. When a job arrives, there is no JVM to start, no context to initialize, and no catalog to discover. The driver is already there, already connected, already warm.

The Elastic Executor Layer. A small set of executors stays warm at all times — enough to handle typical small-job workloads immediately. When demand spikes or a larger job arrives, additional executors scale up within seconds (not minutes, because the underlying compute nodes are pre-provisioned in the pool). When demand drops, excess executors scale back down. You pay for what you use, but you never wait for what you need.

The Intelligent Router. Not every job belongs in a pool. Cazpian's routing layer evaluates each incoming job before execution and makes a placement decision:

  • Input size under 10 GB, no heavy shuffles, broadcastable joins — route to a Compute Pool
  • Large or complex workloads — route to dedicated, right-sized compute
  • Safety valves — if a pooled job exceeds memory or runtime thresholds, it gets automatically requeued to dedicated compute

This routing happens transparently. Your pipelines do not change. Your code does not change. The platform makes the right decision for each job, every time.

What This Means in Practice

When a small ETL job hits a Cazpian Compute Pool:

Traditional:
|-- Provisioning (2-5 min) --|-- Bootstrap (15-45s) --|-- Work (60s) --|
Total: 3-6 minutes billed

Cazpian Compute Pool:
|-- Work (60s) --|
Total: 60 seconds billed

The cold start is not reduced. It is eliminated. Your job starts executing the moment it is submitted.

The Economics: Before and After

Let us revisit the scenarios from Part 1, now with Cazpian Compute Pools in the picture.

Scenario 1: 200 Small Jobs Per Day

Before (job clusters):

ComponentCalculationDaily Cost
Cold-start overhead200 jobs x 4 min x $0.02/min$160.00
Actual compute200 jobs x 3 min x $0.02/min$120.00
Total$280.00

After (Cazpian Compute Pools):

ComponentCalculationDaily Cost
Cold-start overheadEliminated$0.00
Actual compute200 jobs x 3 min x $0.015/min$90.00
Pool warm costShared driver + base executors, 24hr$12.00
Total$102.00

Savings: $178/day — 63.5% reduction. Annualized: $64,970.

The per-minute compute rate drops because pooled executors are right-sized (small instances, not overprovisioned job-cluster templates). The pool warm cost is fixed but amortized across all 200 jobs — just $0.06 per job for always-on readiness.

Scenario 2: 500 Small Jobs Across Multiple Teams

Before (EKS per-job pods):

ComponentCalculationDaily Cost
Scheduling + bootstrap overhead500 jobs x 2.5 min x $0.006/min$7.50
Actual compute500 jobs x 2 min x $0.006/min$6.00
Storage churn (EBS/PVC)500 volumes created/destroyed$4.50
Total$18.00

After (Cazpian Compute Pools):

ComponentCalculationDaily Cost
Scheduling + bootstrap overheadEliminated$0.00
Actual compute500 jobs x 2 min x $0.004/min$4.00
Storage churnEliminated (persistent pool)$0.00
Pool warm costShared across teams$3.20
Total$7.20

Savings: $10.80/day — 60% reduction. At enterprise scale with thousands of daily jobs, this compounds to six-figure annual savings.

Why Not Just Use Serverless Spark?

A fair question. Databricks Serverless and EMR Serverless both reduce cold starts significantly — from minutes down to 15-30 seconds. So why go further?

Because 15-30 seconds still adds up at scale. If you run 500 jobs a day and each one has a 20-second serverless startup, that is 2.7 hours of daily overhead you are still paying for. Cazpian Compute Pools bring this to near zero.

Because serverless still provisions per-job. Each serverless job gets its own isolated Spark runtime. That means each job pays for JVM initialization, context creation, and catalog connection — even if the previous job connected to the same catalog 10 seconds ago. Cazpian pools reuse the warm context across jobs.

Because serverless pricing is opaque. Serverless platforms charge premium rates for the convenience of not managing infrastructure. Cazpian's usage-based billing gives you the same convenience with transparent, predictable pricing — you see exactly what each job costs in real time.

Because serverless does not solve file hygiene. A serverless job still produces whatever output files your code writes. If your small job writes 50 tiny Parquet files, serverless will not help. Cazpian Compute Pools include built-in write coalescing and scheduled compaction, so your Iceberg tables stay performant without manual intervention.

Pool Sizing: Right-Sized by Default

One of the most common mistakes in self-managed Spark environments is using a single cluster template for all workloads. A job processing 500 MB runs on the same hardware configuration as one processing 500 GB.

Cazpian Compute Pools come in preset tiers that match real-world workload patterns:

Small Pool (Default for Jobs Under 5 GB)

  • Driver: 1 vCPU, 1 GB heap + 384 MB overhead
  • Executors: 2-4 instances, each with 1 vCPU and 1-2 GB heap
  • Best for: Lightweight transforms, CSV/JSON to Iceberg conversions, incremental loads, dimension table refreshes

Medium Pool (5-10 GB)

  • Driver: 2 vCPU, 2 GB heap
  • Executors: 4-8 instances, each with 1-2 vCPU and 2-4 GB heap
  • Best for: Moderate joins, aggregations with broadcast-eligible dimension tables, hourly partition writes

These are defaults, not hard limits. The elastic executor layer scales within each pool based on actual job demands. If a Small Pool job needs a brief burst of extra executors for a shuffle, they are available in seconds — not minutes.

Built-In Optimizations You Get for Free

Cazpian Compute Pools are not just warm clusters sitting idle. They are tuned specifically for the small-to-medium job workloads they serve:

Adaptive Query Execution (AQE)

Enabled by default. AQE dynamically coalesces shuffle partitions, converts sort-merge joins to broadcast joins at runtime, and skews partition handling — all without any configuration from your side. For small jobs, this alone can cut execution time by 30-50%.

Smart Shuffle Partitioning

Instead of Spark's default 200 shuffle partitions (wildly excessive for a 2 GB job), Cazpian automatically calculates partition counts based on estimated input size:

partitions = max(2, ceil(estimated_input / 128 MB))

A 1 GB job gets 8 partitions instead of 200. This eliminates the scheduling overhead of hundreds of empty or near-empty tasks and produces cleaner output files.

Generous Broadcast Thresholds

The broadcast join threshold is set high enough (256 MB) that most dimension table joins in small jobs are resolved as broadcast joins — completely avoiding the shuffle. No network transfer, no disk spill, no waiting.

Write Coalescing and Compaction

Every job in a Compute Pool writes output through Cazpian's coalescing layer. Instead of producing dozens of small files per partition, output files target optimal sizes: ~512 MB for Iceberg tables. For jobs that run frequently and append small amounts of data, scheduled compaction runs automatically to merge small files — keeping your tables fast for downstream consumers without any manual maintenance.

Safety and Governance

Shared compute raises fair questions about isolation, security, and fairness. Cazpian addresses all of them:

Workload Isolation

Each job within a Compute Pool runs in its own isolated session. Jobs cannot read each other's intermediate data, access each other's credentials, or interfere with each other's execution. The warm driver and executors are shared infrastructure — the data and logic are strictly isolated.

Fairness Quotas

Compute Pools support workspace-level quotas for concurrency, runtime, and daily bytes processed. If one team submits a burst of 100 jobs, other teams are not starved. Queue management ensures fair scheduling based on configurable priorities.

Auto-Reroute Safety Valves

If a job starts behaving unexpectedly — runtime exceeds 30 minutes, memory spill exceeds a threshold, or partition count explodes — the pool automatically requeues it to dedicated compute. The job still completes; it just runs on infrastructure appropriate for its actual complexity, not the pooled resources.

This fail-safe means you can aggressively route jobs to pools without worrying about edge cases. The system self-corrects.

Data Sovereignty: Your VPC, Your Data

Cazpian Compute Pools run entirely within your AWS VPC. Your data never leaves your environment. The Cazpian control plane manages orchestration, metadata, and billing — but all compute and storage remain in your account, under your security controls.

This is not just a compliance checkbox. It means:

  • Your data governance policies apply without exception
  • Network traffic stays within your VPC boundaries
  • You retain full auditability of every compute operation
  • No third-party data residency concerns

Getting Started: The Path to Zero Cold Starts

Migrating to Cazpian Compute Pools does not require rewriting your pipelines. The typical onboarding follows a four-step path:

Step 1: Baseline. Cazpian analyzes your existing job history — input sizes, runtimes, shuffle volumes, output patterns. This produces a workload profile and a projected savings estimate.

Step 2: Route. The intelligent router is configured for your workspaces. Small jobs automatically flow to Compute Pools. Large jobs continue on dedicated compute. No code changes required.

Step 3: Observe. Real-time dashboards show per-job cost, pool utilization, cold-start time eliminated, and output file health. You see the savings from day one.

Step 4: Expand. As confidence builds, expand pool coverage to more teams and workloads. Tune pool sizes and quotas based on actual usage patterns.

Most teams complete the full onboarding in under two weeks and see measurable cost reduction within the first 48 hours.

Measuring the Impact

Cazpian provides built-in observability for every aspect of Compute Pool performance:

  • Time-to-first-task: How quickly your job started doing real work (target: under 2 seconds)
  • Pool utilization: Jobs per hour and idle percentage — are your pools right-sized?
  • Cost per job: Actual compute cost versus what the same job would have cost on a job cluster
  • Output file health: Average file sizes, small file count, compaction debt
  • Cumulative savings: Running total of cold-start cost eliminated versus your baseline

These are not abstract metrics. They translate directly to dollars saved and SLA improvements that you can report to leadership.

What is Next

This is Part 2 of our series on cutting lakehouse compute costs. In Part 3, we will put Cazpian head-to-head with Databricks and EMR in a 2026 Compute Cost Showdown — comparing real costs across three workload scenarios to show where each platform wins and where it wastes your budget.


Want to see what Cazpian Compute Pools would save for your workloads? Talk to our team — we will run a free baseline analysis on your job history and show you the numbers.