Skip to main content

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

· 14 min read
Cazpian Engineering
Platform Engineering Team

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

In our previous post, we broke down the five hidden costs of running two compute engines in your lakehouse — the infrastructure duplication, the cost opacity, the metadata sync bugs, the skills fragmentation, and the governance headaches. We showed that this dual-engine tax can run $40,000+ per year for a mid-size data team.

The obvious question: why not just use Spark for everything?

The honest answer has always been: because Spark cannot deliver query results to BI tools fast enough. Not because Spark cannot execute the query — it usually can — but because the last mile of data delivery through traditional JDBC/ODBC protocols is painfully slow.

Arrow Flight SQL eliminates that bottleneck. And with it, the primary architectural reason for running a second query engine disappears.

The Real Bottleneck: It is Not the Query Engine

When a data analyst runs a dashboard query through Tableau or Power BI against a lakehouse, the total response time breaks down into three phases:

|-- Query Planning --|-- Query Execution --|-- Result Transfer --|
100-500ms 500ms-10s 1s-60s+

Most optimization discussions focus on the first two phases — faster query planning, better partition pruning, materialized views, Iceberg metadata optimization. These matter. But for many BI workloads, the result transfer phase is the actual bottleneck.

Why Traditional JDBC/ODBC Is Slow

When Spark (or any engine) delivers results through a Thrift-based JDBC or ODBC driver, the process works like this:

  1. The engine executes the query and produces an in-memory result set (often in a columnar format internally).
  2. The JDBC driver serializes each row individually into a text or binary wire format.
  3. Each row is sent over the network, one at a time or in small batches.
  4. The BI tool deserializes each row back into its own internal format.
  5. If the result set has complex types (decimals, timestamps, nested structures), type conversion adds additional overhead at both ends.

For a query returning 500,000 rows with 20 columns — a common dashboard aggregation against a fact table — this row-by-row serialization/deserialization cycle can take 10-60 seconds even when the query itself executed in under 2 seconds.

This is not a Spark-specific problem. Any engine using traditional JDBC/ODBC pays this tax. But for Spark, which already has JVM initialization overhead, the combined latency makes it uncompetitive for interactive BI use cases. This is why architects added Trino or Dremio — not because they query fundamentally faster, but because they were designed with tighter result delivery pipelines.

What Changes with Arrow Flight SQL

Arrow Flight SQL replaces the entire result delivery pipeline with a fundamentally different architecture:

Columnar transfer, not row-based. Results are sent as Apache Arrow RecordBatches — columnar data structures that BI tools and analytical engines already use internally. No row-by-row serialization. The data format on the wire is the same format the BI tool needs in memory.

Zero-copy deserialization. The receiving client can read Arrow data directly from the network buffer without copying it into a separate data structure. The bytes on the wire become the bytes in memory. This eliminates an entire class of CPU-bound overhead.

gRPC/HTTP2 transport. Arrow Flight SQL uses gRPC, which provides multiplexed streaming, header compression, and binary framing. Multiple result partitions can stream concurrently over a single connection. Compared to the request-response model of traditional JDBC, this is far more efficient for large result sets.

Parallel data streams. Flight SQL supports fetching result partitions from multiple endpoints simultaneously. A query that produces results across multiple partitions can stream all partitions concurrently, cutting transfer time proportionally.

Native type preservation. Decimals stay decimals. Timestamps stay timestamps. No string conversion round-trips, no precision loss, no type coercion overhead.

The Performance Difference: Not Incremental, Transformational

The improvement from switching JDBC/ODBC to Arrow Flight SQL is not a 10-20% optimization. Depending on the workload, it is a 10-80x improvement in result delivery speed.

ScenarioTraditional JDBCArrow Flight SQLImprovement
100K rows, 20 columns, simple types3.2s0.15s21x
500K rows, 30 columns, mixed types18s0.8s22x
1M rows, 20 columns, decimals + timestamps45s1.1s41x
5M rows, 10 columns, large aggregation export120s+2.8s43x

These gains are most pronounced in exactly the scenarios that BI tools create: moderate-to-large result sets with typed columns, consumed by analytical clients that need columnar data. The wider the result set and the more complex the types, the bigger the Arrow advantage.

For small result sets (a few hundred rows), the difference is negligible — traditional JDBC is fast enough. But those are not the queries that drive architects to add a second query engine.

Rethinking the Architecture

If result delivery was the real reason you needed a second engine, and Arrow Flight SQL eliminates that bottleneck, the architecture simplifies dramatically.

The Dual-Engine Architecture (Before)

BI Tools (Tableau, Power BI, Looker)
|
| JDBC/ODBC (slow, row-based)
v
[ Trino / Dremio Cluster ] ← Read-only query engine
|
| Reads Iceberg tables
v
[ Iceberg Catalog + S3 ]
^
| Writes Iceberg tables
|
[ Spark Cluster ] ← ETL / batch / streaming engine
^
|
Cazpian Orchestrator / Scheduler

Two engines. Two clusters. Two cost lines. Two governance configurations. Metadata sync issues between them.

The Unified Architecture (After)

BI Tools (Tableau, Power BI, Looker)        Cazpian Orchestrator
| |
| Arrow Flight SQL (fast, columnar) | ETL, Streaming, Batch
v v
[ Arrow Flight SQL Gateway ] [ Cazpian Compute Engine ]
| |
|______________________________|
|
v
[ Iceberg Catalog + S3 ]

One engine. Two access paths. BI tools connect through Arrow Flight SQL for high-speed reads. ETL, streaming, and batch jobs flow through the Cazpian Orchestrator for writes. Both paths hit the same compute layer, the same catalog, the same governance policies.

The query engine cluster is gone. The metadata sync problem is gone. The governance duplication is gone. The cost opacity is gone.

Addressing the Objection: "But Spark Is Too Slow for Interactive Queries"

This is the objection every data architect raises, and it deserves an honest answer.

For sub-second point queries: yes, Spark is slower than a dedicated MPP engine. If your primary use case is a real-time operational dashboard that needs 200ms response times on single-row lookups, a dedicated query engine like Trino will outperform Spark.

But that is not what most BI workloads look like. The vast majority of enterprise BI queries are:

  • Dashboard aggregations over partitioned fact tables (2-10 second acceptable response)
  • Scheduled report generation (minutes are fine)
  • Ad-hoc exploration by analysts (5-15 second response is acceptable)
  • Data exports to Excel or downstream tools (transfer speed matters more than execution speed)

For these workloads, a well-optimized Spark engine with modern Iceberg integration is more than fast enough. Here is why:

Adaptive Query Execution (AQE) dynamically optimizes query plans at runtime — coalescing shuffle partitions, converting joins to broadcasts, handling data skew. This closes much of the gap with dedicated query engines for analytical queries.

Iceberg metadata pruning means Spark only reads the data files it needs. Partition pruning, file-level min/max statistics, and manifest filtering ensure that a query over a 10 TB table only scans the relevant 500 MB. The query engine's advantage of "faster scanning" becomes irrelevant when both engines scan the same small subset.

Cazpian Native Execution Engine replaces JVM-based processing with Arrow-native vectorized execution for Parquet and Iceberg operations. Instead of interpreting query plans through Spark's traditional JVM code paths, Cazpian offloads scan, filter, aggregation, and shuffle operations to a native columnar engine built on Apache Arrow. The result is Photon-class performance — 2-4x faster for scan-heavy analytical queries — without proprietary lock-in. Because the native engine operates directly on Arrow columnar data, there is zero serialization overhead between execution and result delivery through Arrow Flight SQL.

Warm Compute Pools (as described in our earlier post) eliminate Spark's cold-start overhead. When the driver is already running and executors are already allocated, Spark's query initialization time drops from minutes to milliseconds.

Arrow Flight SQL eliminates the delivery bottleneck. The gap that made Spark feel 5-10x slower than Trino for BI queries was largely in the JDBC result delivery. Remove that bottleneck, and the total query-to-dashboard time converges.

The net result: for 90%+ of typical BI workloads, a Cazpian Compute Engine with native execution and Arrow Flight SQL delivers an experience that is within 1-3 seconds of a dedicated query engine — comfortably within the tolerance of dashboard users. And for that remaining edge case of sub-second operational dashboards, caching and materialized views can bridge the gap.

What BI Tools See

A critical practical question: do BI tools need to change to work with Arrow Flight SQL?

No. Arrow Flight SQL provides standard JDBC and ODBC drivers. From the perspective of Tableau, Power BI, Looker, or any other BI tool, the connection is a normal JDBC endpoint. The tool does not know (or need to know) that Arrow Flight SQL is running underneath. It submits SQL, it gets results — just dramatically faster.

This is important because it means migrating from a dual-engine architecture to a unified one does not require changing your BI layer. Your dashboards, your connection strings, your SQL — they all stay the same. The only thing that changes is what is behind the JDBC endpoint.

How Cazpian Implements This

Cazpian's unified compute architecture is built on the principle of one engine, two access paths:

The Write Path: Cazpian Orchestrator

ETL jobs, streaming pipelines, Iceberg maintenance tasks, and AI workloads are submitted by the Cazpian Orchestrator directly to the Cazpian Compute Engine. These are write-heavy, stateful workloads that need Spark's full distributed processing capabilities — checkpointing, fault tolerance, complex transformations, ML pipeline stages.

The Orchestrator handles scheduling, dependency management, and resource allocation. Jobs are routed to Compute Pools for small/medium workloads or to dedicated compute for large batch jobs.

The Read Path: Arrow Flight SQL Gateway

BI tools and analysts connect through Cazpian's Arrow Flight SQL gateway. The gateway accepts standard JDBC/ODBC connections, translates them to Arrow Flight SQL, delegates the query to the same Cazpian Compute Engine, and streams back Arrow-format results.

The result: a Tableau dashboard query gets the full power of a distributed Spark engine for execution, plus the full speed of Arrow columnar transfer for delivery. No separate query engine cluster. No metadata sync. No governance duplication.

What Makes This Work as a Platform

The key is not just "put Arrow Flight SQL in front of Spark." That is a technical building block, not a platform. What makes Cazpian's architecture practical is the surrounding infrastructure:

Unified governance. The same Cazpian IAM and Catalog policies apply to both the write path and the read path. A column-level security rule configured once is enforced everywhere — whether the access comes from an ETL job or a BI dashboard query.

Unified billing. One compute model, one bill. No trying to reconcile Spark cluster costs with Trino cluster costs. Every query, whether it comes through the Orchestrator or the Flight SQL gateway, is metered in the same transparent units.

Unified observability. A single lineage graph shows how data flows from ingestion through transformation to the dashboard. No stitching together logs from two different engines to trace a data quality issue.

Workload-aware resource management. Interactive queries through the Flight SQL gateway get prioritized for low-latency response. Batch ETL jobs through the Orchestrator get prioritized for throughput. Both share the same compute pool, with intelligent scheduling to prevent batch jobs from starving interactive queries or vice versa.

The Competitive Landscape

How does this compare to what other platforms offer?

Databricks

Databricks separates workloads into Jobs Compute (for ETL) and SQL Warehouses (for interactive queries, powered by Photon). These are effectively two different runtimes with separate billing. Photon is a proprietary C++ vectorized engine — fast, but closed-source and only available within Databricks. SQL Warehouses use a proprietary Thrift-based protocol for BI connectivity — not Arrow Flight SQL. The result: you get a managed dual-runtime experience with strong query performance, but it is still two engines with two cost profiles and full vendor lock-in on the acceleration layer.

Dremio

Dremio pioneered Arrow Flight SQL for BI connectivity. Their query engine is fast and Arrow-native. But Dremio is a read-only query engine. You still need Spark (or Glue, or another engine) for ETL, streaming, and data engineering workloads. Dremio plus Spark is the dual-engine pattern with a faster read layer — better, but not unified.

Trino / Starburst

Trino provides interactive SQL with recent additions for Iceberg write support (MERGE INTO). But complex ETL, streaming, and ML workloads are outside Trino's scope. Running Starburst alongside Spark is a managed dual-engine experience.

Cazpian

One engine for ETL, streaming, interactive queries, and AI. Arrow Flight SQL for high-performance BI access. Single governance, single billing, single operational surface. All compute runs in your VPC.

CapabilityDatabricksDremioTrinoCazpian
ETL / BatchJobs ComputeNeeds SparkNeeds SparkCazpian Compute Engine
StreamingStructured StreamingNot supportedNot supportedCazpian Compute Engine
Interactive SQLSQL Warehouse (Photon)Dremio EngineTrino EngineCazpian Compute Engine
Native ExecutionPhoton (proprietary C++)Arrow-nativeJVM-basedNative Engine (Arrow-based, open)
BI Access ProtocolThrift/JDBCArrow Flight SQLJDBCArrow Flight SQL
ML / AIMLflow + SparkNot supportedNot supportedSpark ML + AI Studio
Runtimes to manage22 (Dremio + Spark)2 (Trino + Spark)1
Governance modelUnified (Unity Catalog)PartialExternalUnified (Cazpian Catalog)

What You Eliminate

Switching from a dual-engine architecture to Cazpian's unified model removes entire categories of operational overhead:

  • No second cluster to manage. No Trino/Dremio infrastructure to provision, scale, monitor, upgrade, or troubleshoot.
  • No metadata synchronization. One engine reads and writes to the same catalog. No stale cache, no schema drift between engines, no statistics mismatch.
  • No governance duplication. One set of access control policies. One audit trail. One compliance surface.
  • No cost attribution complexity. Every query and every job runs through one compute layer with one billing model.
  • No data freshness lag. When a Spark ETL job writes new data, the next BI query through Arrow Flight SQL sees it immediately — same engine, same catalog, same snapshot.
  • Faster result delivery. Arrow Flight SQL's columnar transfer delivers results 10-50x faster than Thrift-based JDBC, making the unified architecture feel as fast as (or faster than) a dedicated query engine for most BI workloads.

Getting Started

If you are currently running Spark alongside Trino or Dremio, consolidating to a unified platform does not require a big-bang migration:

  1. Audit your query engine usage. What queries run on Trino/Dremio? What percentage are simple aggregations that Spark with AQE could handle? What percentage require sub-second latency?
  2. Pilot with one workload. Move a single dashboard or reporting pipeline to Cazpian's Arrow Flight SQL endpoint. Compare the experience — latency, throughput, governance.
  3. Measure the operational savings. Track the reduction in infrastructure costs, engineering hours, and incident volume as you decommission the second engine.
  4. Expand progressively. Migrate additional BI workloads to the unified platform. Keep the dedicated query engine only for the edge cases that genuinely require sub-second interactive latency.

Most teams find that 80-90% of their query engine workloads can move to a unified platform without any degradation in user experience.


Running Spark alongside a query engine and wondering if you can simplify? Contact the Cazpian team — we will help you assess which workloads can consolidate and project your savings.