Skip to main content

Why Your Data Platform Runs Two Engines — And Why That's Costing You

· 11 min read
Cazpian Engineering
Platform Engineering Team

Why Your Data Platform Runs Two Engines — And Why That's Costing You

Take an honest look at your data platform architecture. If you are running a lakehouse on AWS, there is a good chance it looks something like this: Spark clusters for ETL and data engineering, plus Trino (or Dremio, or Presto) clusters for analytics and BI queries. Two engines, two teams, two bills — all pointed at the same data.

This dual-runtime pattern has become the default architecture for most modern data platforms. And on the surface, it makes sense. Spark is great at processing data. Trino is great at querying it. Each engine solves a real problem.

But running two engines has hidden costs that most organizations never quantify — and once you add them up, the number is hard to ignore.

How We Got Here

The dual-engine pattern did not happen by accident. It evolved from a real architectural limitation.

Spark was built for processing, not for interactive queries. When a data analyst opens Tableau and runs a dashboard query, Spark's response model — spin up a JVM, initialize a SparkContext, allocate executors, plan the query, execute it, serialize the results row by row through a Thrift-based JDBC driver — is architecturally mismatched for the sub-second response that BI tools expect. For a simple SELECT aggregation on a well-partitioned Iceberg table, Spark might take 8-15 seconds where a dedicated query engine takes 2-3.

Trino and Dremio were built for exactly that gap. They are MPP (massively parallel processing) query engines optimized for interactive SQL. They keep clusters warm, plan queries fast, and deliver results through optimized protocols. For dashboards, ad-hoc exploration, and BI tool connectivity, they are genuinely better than raw Spark.

But they cannot do ETL. Trino's write capabilities are limited — INSERT operations are slower than Spark for large-scale data movement, complex multi-step transformations with checkpointing are not supported, and there is no streaming capability. Dremio is fundamentally read-optimized. Neither can replace Spark for data engineering workloads.

So the industry settled on a pragmatic compromise: Spark for writes, a query engine for reads. Two engines, same data, shared catalog. It works. But "it works" is not the same as "it is efficient."

The Five Hidden Costs of Running Two Engines

1. Infrastructure Duplication

Two engines means two sets of clusters, two scaling policies, two monitoring stacks, two alerting configurations, and two upgrade cycles.

Your Spark clusters need to be provisioned for batch windows — sized for the heaviest ETL runs, scaled down overnight. Your Trino or Dremio clusters need to be warm during business hours for interactive queries, possibly with auto-scaling for dashboard peak times.

Neither cluster is fully utilized around the clock. Spark clusters sit idle during the day when analysts are querying. Query engine clusters sit idle at night when batch jobs run. You are paying for two sets of compute to cover two different usage windows.

Estimated overhead: For a mid-size data team, the idle compute across both engine clusters typically runs $800-2,000/month — resources provisioned but underutilized because the usage patterns of the two engines do not overlap.

2. Cost Opacity

When your data platform spans two engines, attributing costs to specific workloads becomes significantly harder.

A BI dashboard query that takes 3 seconds on Trino was only possible because a Spark job ran 2 hours earlier to transform and materialize the data. What is the true cost of that dashboard? It is the Trino query cost plus an allocated share of the Spark job cost plus the infrastructure overhead of both clusters. No single dashboard shows this end-to-end.

FinOps teams report that cost attribution accuracy drops by 30-40% when workloads span multiple compute engines. You know your total cloud bill. You struggle to know what each pipeline or dashboard actually costs.

3. Metadata Synchronization Pain

When Spark writes an Iceberg table and Trino reads it, both engines interact with the same catalog. In theory, this is seamless. In practice, it is a recurring source of production incidents.

Stale metadata reads. Trino may cache query plans or table metadata for performance. After a Spark write commits new data files, Trino might still read the previous snapshot until its cache expires or is invalidated. The result: dashboards showing stale data, analysts making decisions on outdated numbers, and support tickets that are maddening to debug because "the data is right in one tool but wrong in the other."

Schema evolution conflicts. A data engineer adds a column in Spark. The Iceberg table handles schema evolution gracefully — but Trino's query planner may not immediately pick up the change, or worse, an existing dashboard query may fail because it is hardcoded to expect a specific column order.

Statistics mismatch. Spark computes table statistics for its own optimizer. Trino computes its own statistics independently. Neither trusts the other's numbers. This means suboptimal query plans on at least one side, leading to slower queries or excessive resource usage.

These are not hypothetical problems. They are the most commonly reported multi-engine bugs in data engineering community discussions. Every team running Spark alongside Trino or Dremio has hit at least one of these.

4. Skills Fragmentation

Two engines means two skillsets on your team.

Your data engineers write PySpark, manage Spark configurations, and debug Spark UI execution plans. Your analysts and analytics engineers write SQL and expect Trino-style query behavior. Your platform team needs to understand the operational characteristics of both engines — different failure modes, different tuning knobs, different scaling behavior.

When an incident spans both engines — a Spark job produces unexpected output that causes a Trino query to fail downstream — debugging requires someone who understands both systems well enough to trace the problem across the boundary. That person is rare and expensive.

The hiring cost is real too. You are not just hiring "data engineers." You are hiring people with dual-engine experience, or you are hiring separate specialists for each engine. Either way, the talent pool is more constrained and the team is more fragmented than it needs to be.

5. Governance and Security Fragmentation

Access control configured in one engine does not automatically apply in the other.

If you set up column-level security in Spark (via your catalog's access control layer), you need to verify that Trino enforces the same policy. If you configure row-level filtering for a multi-tenant dataset, both engines need to implement it consistently. Audit trails come from two different systems — stitching them together for a compliance report requires custom tooling or manual effort.

Most teams solve this by centralizing governance in the catalog layer (e.g., Apache Iceberg REST Catalog with policy enforcement). This helps, but it adds another component to manage and another potential point of failure. And even with a centralized catalog, the enforcement happens independently in each engine — so testing and verifying consistent behavior across both is an ongoing operational task.

Quantifying the Dual-Engine Tax

Let us put approximate numbers on a typical mid-size data team running both Spark and Trino on AWS.

Cost CategoryMonthly Estimate
Trino cluster (warm during business hours, r5.2xlarge x 5)$2,400
Spark clusters (ETL batch windows + small jobs)$3,200
Idle compute (underutilization across both)$1,200
Platform engineering time (dual-engine ops, ~15 hrs/mo at $100/hr)$1,500
Incident debugging across engine boundaries (~5 hrs/mo)$500
Governance and access control sync (~4 hrs/mo)$400
Total dual-engine cost$9,200/mo
Single-engine equivalent (Spark-only, well-managed)$5,800/mo
Dual-engine tax$3,400/mo ($40,800/year)

The exact numbers vary by organization, but the pattern is consistent: 30-40% of total data platform compute spend is attributable to the overhead of running two engines rather than one.

Why Neither Engine Can Replace the Other (Today)

If the dual-engine pattern is expensive, why not just pick one?

Because each engine has real limitations that the other compensates for:

Trino and Dremio Cannot Replace Spark

  • Write performance is limited. Trino's INSERT operations are significantly slower than Spark for large-scale data movement. Dremio is fundamentally read-optimized.
  • No complex ETL support. Multi-step transformations with intermediate checkpointing, iterative algorithms, and procedural logic are not expressible in these query engines.
  • No streaming. Structured Streaming pipelines that process continuous data feeds have no equivalent in Trino or Dremio.
  • No ML/AI pipeline support. Training models, running feature engineering, and executing inference pipelines require Spark's programmatic API and ML libraries.
  • Fault tolerance for writes is weaker. If a large write operation fails mid-execution, Spark's task-level retry and checkpoint mechanisms are far more robust.

Spark Cannot Fully Replace Query Engines for Interactive Analytics

  • Cold starts kill interactivity. Even with warm pools, Spark's query initialization overhead is higher than a dedicated query engine's. For a dashboard that needs sub-second response, this matters.
  • JVM overhead for simple queries. A simple aggregation on a well-indexed table does not need Spark's full distributed processing framework. A lightweight MPP engine handles it faster.
  • JDBC/ODBC result delivery is slow. Spark's Thrift-based JDBC driver serializes results row by row. For queries returning large result sets to BI tools, this becomes the bottleneck.
  • Resource-heavy for ad-hoc usage. Keeping Spark warm for occasional ad-hoc queries is expensive relative to the value delivered.

This is the architectural tension that has kept two engines alive in most data platforms. Each engine is genuinely good at what it does — and genuinely poor at what the other does.

Three Paths Forward

Path 1: Accept the Dual-Engine Reality and Optimize

Keep both engines but reduce the overhead. Centralize governance in the catalog. Implement shared monitoring. Use Spot/preemptible instances for both clusters. Automate scaling policies. Accept the operational cost as the price of having the right tool for each workload.

Trade-off: You reduce the overhead but do not eliminate it. The dual-engine tax shrinks but remains.

Path 2: Wait for Engines to Converge

Spark is getting better at interactive queries (AQE, better Iceberg integration, optimized runtimes). Trino is slowly adding write capabilities (MERGE INTO for Iceberg). Over time, the gap narrows.

Trade-off: This is happening, but slowly. Spark may never match a dedicated query engine for sub-second latency, and Trino may never match Spark for complex ETL. You could be waiting for convergence that never fully arrives.

Path 3: Use a Platform That Abstracts the Engine Choice

Instead of asking "which engine should I run?" — use a platform that handles the routing transparently. A well-managed lakehouse platform can serve ETL workloads and interactive queries from a unified compute layer, with intelligent routing, right-sized resources, and a high-performance data access protocol that eliminates the traditional bottlenecks.

Trade-off: You give up direct engine control in exchange for operational simplicity and unified cost management.

What a Unified Architecture Looks Like

The path to eliminating the dual-engine tax is not about making Spark faster or making Trino write better. It is about building a platform layer that:

  1. Uses a single compute engine for all workloads — ETL, batch, streaming, and interactive queries
  2. Exposes multiple access paths — an orchestration path for write-heavy workloads and a high-performance SQL access path for read-heavy analytics
  3. Provides a single governance layer — one set of policies, one audit trail, one catalog
  4. Bills transparently — one cost model, not two overlapping cloud bills

This is the architecture Cazpian is built on. A single Cazpian Compute Engine handles ETL, streaming, and interactive queries. Write workloads flow through the Cazpian Orchestrator. Read workloads flow through a high-performance Arrow Flight SQL gateway that delivers results to BI tools at columnar speed — 10-50x faster than traditional JDBC/ODBC.

One engine. Two access paths. Zero duplication.

In our next post, we will dive into how Arrow Flight SQL changes the equation — why the traditional data transfer bottleneck forced architects to add query engines in the first place, and how removing that bottleneck makes a single-engine lakehouse not just possible but practical.


Running Spark and Trino (or Dremio) side by side? Talk to our team — we can help you quantify your dual-engine tax and map a path to consolidation.