Skip to main content

10 posts tagged with "Cost Optimization"

View All Tags

Iceberg Metrics Reporting: How to Monitor Scan and Commit Health with Spark

· 20 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Metrics Reporting: How to Monitor Scan and Commit Health with Spark

You designed the partitions correctly. You set up compaction. You even configured bloom filters. But your Iceberg tables are still slow — and you have no idea why. Is it the scan planning? Too many manifests? Delete files accumulating silently? Commit retries from writer contention? You cannot fix what you cannot see.

Apache Iceberg actually gives you everything you need to diagnose table health. The problem is that the metrics are scattered across six different layers — a Java API, virtual SQL tables, snapshot properties, file-level statistics, Puffin blobs, and engine-level instrumentation — and no one has assembled them into a single picture. This post does exactly that.

We will walk through every layer of Iceberg metrics, show you how to collect them, explain what each metric means for your read and write performance, and give you concrete thresholds and SQL queries that tell you when something is wrong and what to do about it.

Iceberg Bloom Filters with Spark: Configuration, Validation, and Performance Guide

· 21 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Bloom Filters with Spark: Configuration, Validation, and Performance Guide

When you query an Iceberg table with WHERE user_id = 'abc-123', Spark reads every Parquet file that could contain that value. It first checks partition pruning — does this file belong to the right partition? Then it checks column statistics — does the min/max range for user_id in this file include 'abc-123'? But for high-cardinality columns like UUIDs, user IDs, session IDs, or trace IDs, min/max statistics are nearly useless. The min might be 'aaa...' and the max might be 'zzz...', so every file passes the min/max check even though only one file actually contains the value.

This is where bloom filters come in. A bloom filter is a compact probabilistic data structure embedded in each Parquet file that can definitively say "this value is NOT in this file" — allowing Spark to skip the file entirely. For point lookups on high-cardinality columns, bloom filters can reduce I/O by 80-90%.

This post covers everything you need to know: how bloom filters work internally, when to use them, how to configure them on Iceberg tables, how to validate they are present in your Parquet files, and what false positives mean for your data correctness.

Storage Partitioned Joins in Apache Iceberg with Spark

· 13 min read
Cazpian Engineering
Platform Engineering Team

Storage Partitioned Joins in Apache Iceberg with Spark

Every Spark join starts the same way: read both sides, shuffle the data across the network so matching keys end up on the same executor, then join. That shuffle is the single most expensive operation in most Spark jobs — it moves data across the network, writes temporary files to disk, and consumes memory on every executor in the cluster.

But what if both tables are already organized by the join key on disk? If the left table's customer_id=42 rows are in bucket 42 and the right table's customer_id=42 rows are also in bucket 42, there is nothing to shuffle. Each executor can join its local partitions independently.

That is exactly what Storage Partitioned Join (SPJ) does. Introduced in Spark 3.3 and matured in Spark 3.4+, SPJ is the most impactful — and least understood — optimization available for Iceberg+Spark workloads. This post shows you how it works, how to set it up, how to verify it, and where it breaks.

Iceberg on AWS: S3FileIO, Glue Catalog, and Performance Optimization Guide

· 20 min read
Cazpian Engineering
Platform Engineering Team

Iceberg on AWS: S3FileIO, Glue Catalog, and Performance Optimization Guide

If you are running Apache Iceberg on AWS, the single most impactful configuration decision you will make is your choice of FileIO implementation. Most teams start with HadoopFileIO and s3a:// paths because that is what their existing Hadoop-based stack already uses. It works, but it leaves significant performance on the table.

Iceberg's native S3FileIO was built from the ground up for object storage. It uses the AWS SDK v2 directly, skips the Hadoop filesystem abstraction entirely, and implements optimizations that s3a cannot — progressive multipart uploads, native bulk deletes, and zero serialization overhead. Teams that switch typically see faster writes, faster commits, and lower memory usage across the board.

This post covers everything you need to run Iceberg on AWS efficiently: why S3FileIO outperforms s3a, how to configure every critical property, how to avoid S3 throttling, how to set up Glue catalog correctly, and how to secure your tables with encryption and credential vending.

Iceberg Query Performance Tuning: Partition Pruning, Bloom Filters, and Spark Configs

· 19 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Query Performance Tuning

Your Iceberg tables are created with the right properties. Your partitions are well-designed. But your queries are still slower than you expected. The dashboard that should load in 3 seconds takes 45. The data scientist's notebook times out. The problem is not your table design — it is that you have not tuned the layers between the query and the data.

Apache Iceberg has a sophisticated query planning pipeline that can skip entire partitions, skip individual files within a partition, and even skip row groups within a file. But each of these layers only works if you configure it correctly. This post walks through every pruning layer, explains exactly how Iceberg uses metadata to skip work, and gives you the Spark configurations to control it all.

Mastering Iceberg File Sizes: How Spark Write Controls and Table Optimization Prevent the Small File Nightmare

· 13 min read
Cazpian Engineering
Platform Engineering Team

Mastering Iceberg File Sizes: Spark Write Controls and Table Optimization

Every data engineer who has worked with Apache Iceberg at scale has hit the same wall: query performance that mysteriously degrades over time. The dashboards that used to load in two seconds now take twenty. The Spark jobs that processed in minutes now crawl for an hour. The root cause, almost always, is the same — thousands of tiny files have silently accumulated in your Iceberg tables.

The small file problem is not unique to Iceberg. But Iceberg gives you an unusually powerful set of tools to prevent it at the write layer and fix it at the maintenance layer. The catch is that most teams never configure these controls properly — or do not even know they exist.

Why Your Data Platform Runs Two Engines — And Why That's Costing You

· 11 min read
Cazpian Engineering
Platform Engineering Team

Why Your Data Platform Runs Two Engines — And Why That's Costing You

Take an honest look at your data platform architecture. If you are running a lakehouse on AWS, there is a good chance it looks something like this: Spark clusters for ETL and data engineering, plus Trino (or Dremio, or Presto) clusters for analytics and BI queries. Two engines, two teams, two bills — all pointed at the same data.

This dual-runtime pattern has become the default architecture for most modern data platforms. And on the surface, it makes sense. Spark is great at processing data. Trino is great at querying it. Each engine solves a real problem.

But running two engines has hidden costs that most organizations never quantify — and once you add them up, the number is hard to ignore.

Databricks vs. EMR vs. Cazpian: The 2026 Compute Cost Showdown

· 13 min read
Cazpian Engineering
Platform Engineering Team

Databricks vs. EMR vs. Cazpian: The 2026 Compute Cost Showdown

"Which platform is cheapest for Spark?" is one of the most common questions data teams ask — and one of the most misleading. The honest answer is: it depends entirely on your workload shape.

A platform that saves you thousands on large nightly batch jobs might quietly waste thousands on your fleet of small ETL runs. The billing model that looks transparent at first glance might hide costs in cold starts, minimum increments, or idle compute you never asked for.

In this post — Part 3 of our compute cost series — we compare Databricks, Amazon EMR, and Cazpian across three realistic workload scenarios. No hypotheticals. Real pricing. Real math.

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

· 11 min read
Cazpian Engineering
Platform Engineering Team

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

In Part 1 of this series, we exposed the Small Job Tax — the hidden cost of cold starts, overprovisioned clusters, and per-job infrastructure overhead that silently drains data budgets. We showed that for many teams, more than half of their Spark compute spend goes to infrastructure bootstrapping, not actual data processing.

The natural follow-up question: what if you could eliminate that overhead entirely?

That is exactly what Cazpian Compute Pools are built to do.

The Small Job Tax: How Spark Cold Starts Are Silently Draining Your Data Budget

· 10 min read
Cazpian Engineering
Platform Engineering Team

The Small Job Tax: How Spark Cold Starts Are Silently Draining Your Data Budget

Most data teams obsess over optimizing their biggest, most complex Spark jobs. Meanwhile, hundreds of tiny ETL jobs — each processing a few gigabytes — quietly rack up a bill that nobody questions.

We call it the Small Job Tax: the disproportionate cost of running lightweight workloads on infrastructure designed for heavy lifting. And for many organizations, it is the single largest source of wasted compute spend.