Skip to main content

7 posts tagged with "ETL"

View All Tags

Spark JDBC Data Source: The Complete Optimization Guide for Reads, Writes, and Pushdown

· 43 min read
Cazpian Engineering
Platform Engineering Team

Spark JDBC Data Source: The Complete Optimization Guide

You have a 500 million row table in PostgreSQL. You write spark.read.jdbc(url, "orders", properties) and hit run. Thirty minutes later, the job is still running. One executor is at 100% CPU. The other 49 are idle. Your database server is pegged at a single core, streaming rows through a single JDBC connection while your 50-node Spark cluster sits there doing nothing.

This is the default behavior of Spark JDBC reads. No partitioning. No parallelism. One thread, one connection, one query: SELECT * FROM orders. Every row flows through a single pipe. It is the number one performance mistake data engineers make with Spark JDBC, and it is the default.

This post covers everything you need to know to fix it and to optimize every aspect of Spark JDBC reads and writes. We start with why the default is so slow, then go deep on parallel reads, all pushdown optimizations, fetchSize and batchSize tuning, database-specific configurations, write optimizations, advanced patterns, monitoring and debugging, anti-patterns, and a complete configuration reference.

Iceberg CDC: Patterns, Best Practices, and Real-World Pipelines

· 14 min read
Cazpian Engineering
Platform Engineering Team

Iceberg CDC Patterns and Pipelines

You have an operational database — PostgreSQL, MySQL, or DynamoDB — and you need its data in your Iceberg lakehouse. Not a daily snapshot dump. Not a nightly batch export. You need changes replicated continuously so that your analytics, ML models, and dashboards reflect reality within minutes.

This is Change Data Capture (CDC) on Iceberg, and it is one of the most common — and most operationally challenging — data engineering patterns in production today. The ingestion part is straightforward. The hard parts are handling deletes efficiently, keeping read performance from degrading, managing schema changes, and operating the pipeline at scale without it falling over at 3 AM.

This guide covers the two primary CDC architectures (direct materialization and the bronze-silver pattern), table design for CDC workloads, Iceberg's built-in CDC capabilities, compaction strategies, and the operational patterns that keep CDC pipelines healthy in production.

Writing Efficient MERGE INTO Queries on Iceberg with Spark

· 13 min read
Cazpian Engineering
Platform Engineering Team

Writing Efficient MERGE INTO Queries on Iceberg

MERGE INTO is the most powerful and the most misused operation in Apache Iceberg. It handles upserts, conditional deletes, SCD Type-2 updates, and CDC application — all in a single atomic statement. But it is also the operation most likely to trigger a full table scan, blow up your compute costs, and produce thousands of small files if you do not write it carefully.

The difference between a well-written and a poorly-written MERGE INTO on the same table can be the difference between 30 seconds and 30 minutes — and between $2 and $200 in compute cost. This post shows you exactly how to write it right.

Migrating From Hive Tables to Apache Iceberg: The Complete Guide — From On-Prem Hadoop to Cloud Lakehouse

· 24 min read
Cazpian Engineering
Platform Engineering Team

Migrating From Hive Tables to Apache Iceberg

If you are reading this, you probably fall into one of two camps. Either your Hive tables are already on cloud object storage (S3, GCS, ADLS) and you want to convert them to Iceberg format. Or — and this is the harder problem — your Hive tables are sitting on an on-premises Hadoop cluster with HDFS, and you need to move everything to a cloud-based lakehouse with Iceberg.

This guide covers both scenarios. We start with the harder one — migrating from on-prem Hadoop HDFS to a cloud data lake with Iceberg — because that is where most teams get stuck. Then we cover the table format conversion for data already on cloud storage. Both paths converge at the same destination: a modern, open lakehouse built on Apache Iceberg.

Schema Evolution in Apache Iceberg: The Feature That Saves Data Teams Thousands of Hours

· 10 min read
Cazpian Engineering
Platform Engineering Team

Schema Evolution in Apache Iceberg

Every data engineer has lived this nightmare: a product team needs a new field in the events table. In a traditional data warehouse, this means a migration ticket, a maintenance window, potentially hours of data rewriting, and a prayer that no downstream pipeline breaks. In a Hive-based data lake, it is even worse — you add the column, but old Parquet files do not have it, partition metadata gets confused, and three different teams spend a week debugging null values.

Apache Iceberg eliminates this entire class of problems. Schema evolution in Iceberg is a metadata-only operation. No data rewrites. No downtime. No broken queries. And the mechanism that makes this possible is both simple and elegant.

Mastering Iceberg File Sizes: How Spark Write Controls and Table Optimization Prevent the Small File Nightmare

· 13 min read
Cazpian Engineering
Platform Engineering Team

Mastering Iceberg File Sizes: Spark Write Controls and Table Optimization

Every data engineer who has worked with Apache Iceberg at scale has hit the same wall: query performance that mysteriously degrades over time. The dashboards that used to load in two seconds now take twenty. The Spark jobs that processed in minutes now crawl for an hour. The root cause, almost always, is the same — thousands of tiny files have silently accumulated in your Iceberg tables.

The small file problem is not unique to Iceberg. But Iceberg gives you an unusually powerful set of tools to prevent it at the write layer and fix it at the maintenance layer. The catch is that most teams never configure these controls properly — or do not even know they exist.

The Small Job Tax: How Spark Cold Starts Are Silently Draining Your Data Budget

· 10 min read
Cazpian Engineering
Platform Engineering Team

The Small Job Tax: How Spark Cold Starts Are Silently Draining Your Data Budget

Most data teams obsess over optimizing their biggest, most complex Spark jobs. Meanwhile, hundreds of tiny ETL jobs — each processing a few gigabytes — quietly rack up a bill that nobody questions.

We call it the Small Job Tax: the disproportionate cost of running lightweight workloads on infrastructure designed for heavy lifting. And for many organizations, it is the single largest source of wasted compute spend.