Skip to main content

23 posts tagged with "Lakehouse"

View All Tags

Migrating From Hive Tables to Apache Iceberg: The Complete Guide — From On-Prem Hadoop to Cloud Lakehouse

· 24 min read
Cazpian Engineering
Platform Engineering Team

Migrating From Hive Tables to Apache Iceberg

If you are reading this, you probably fall into one of two camps. Either your Hive tables are already on cloud object storage (S3, GCS, ADLS) and you want to convert them to Iceberg format. Or — and this is the harder problem — your Hive tables are sitting on an on-premises Hadoop cluster with HDFS, and you need to move everything to a cloud-based lakehouse with Iceberg.

This guide covers both scenarios. We start with the harder one — migrating from on-prem Hadoop HDFS to a cloud data lake with Iceberg — because that is where most teams get stuck. Then we cover the table format conversion for data already on cloud storage. Both paths converge at the same destination: a modern, open lakehouse built on Apache Iceberg.

Time Travel in Apache Iceberg: Beyond the Basics — Auditing, Debugging, and ML Reproducibility

· 12 min read
Cazpian Engineering
Platform Engineering Team

Time Travel in Apache Iceberg: Beyond the Basics

Every Apache Iceberg overview mentions time travel. "Query your data as it existed at any point in time." It sounds impressive, gets a mention in the feature list, and then most teams never use it beyond the occasional ad-hoc debugging query.

That is a missed opportunity. Iceberg's snapshot system is not just a convenience feature — it is a production-grade capability that can replace custom auditing infrastructure, eliminate data recovery anxiety, and solve one of machine learning's hardest problems: dataset reproducibility.

This post goes beyond the basics. We will cover the snapshot architecture, the practical query patterns, branching and tagging, the Write-Audit-Publish pattern, and real-world use cases that make time travel indispensable.

Schema Evolution in Apache Iceberg: The Feature That Saves Data Teams Thousands of Hours

· 10 min read
Cazpian Engineering
Platform Engineering Team

Schema Evolution in Apache Iceberg

Every data engineer has lived this nightmare: a product team needs a new field in the events table. In a traditional data warehouse, this means a migration ticket, a maintenance window, potentially hours of data rewriting, and a prayer that no downstream pipeline breaks. In a Hive-based data lake, it is even worse — you add the column, but old Parquet files do not have it, partition metadata gets confused, and three different teams spend a week debugging null values.

Apache Iceberg eliminates this entire class of problems. Schema evolution in Iceberg is a metadata-only operation. No data rewrites. No downtime. No broken queries. And the mechanism that makes this possible is both simple and elegant.

Apache Polaris: How Policy-Managed Table Maintenance Eliminates Iceberg Operational Overhead

· 12 min read
Cazpian Engineering
Platform Engineering Team

Apache Polaris: Policy-Managed Iceberg Table Maintenance

In our previous post, we covered how to control Iceberg file sizes at write time and how to fix small file problems with Iceberg's table maintenance procedures. The conclusion was clear: the tools are excellent, but manually scheduling and managing maintenance across dozens or hundreds of tables does not scale.

This post is about the layer that solves that problem: Apache Polaris — the open-source Iceberg catalog that introduces policy-based table maintenance, letting you define optimization rules once and have them applied automatically across your entire lakehouse.

Mastering Iceberg File Sizes: How Spark Write Controls and Table Optimization Prevent the Small File Nightmare

· 13 min read
Cazpian Engineering
Platform Engineering Team

Mastering Iceberg File Sizes: Spark Write Controls and Table Optimization

Every data engineer who has worked with Apache Iceberg at scale has hit the same wall: query performance that mysteriously degrades over time. The dashboards that used to load in two seconds now take twenty. The Spark jobs that processed in minutes now crawl for an hour. The root cause, almost always, is the same — thousands of tiny files have silently accumulated in your Iceberg tables.

The small file problem is not unique to Iceberg. But Iceberg gives you an unusually powerful set of tools to prevent it at the write layer and fix it at the maintenance layer. The catch is that most teams never configure these controls properly — or do not even know they exist.

Why Every Data Company Is Betting on Apache Iceberg — And What It Means for AI

· 13 min read
Cazpian Engineering
Platform Engineering Team

Why Every Data Company Is Betting on Apache Iceberg

Something unusual is happening in the data industry. Companies that have spent years — and billions of dollars — building proprietary storage formats are now rallying behind an open-source table format created at Netflix. Snowflake, Databricks, Dremio, Starburst, Teradata, Google BigQuery, AWS — the list keeps growing. They are not just adding Iceberg as a checkbox feature. They are making it central to their platform strategy.

If you are a data engineer, you have almost certainly heard of Apache Iceberg by now. But the more interesting question is not what Iceberg is — it is why every major vendor has decided that their own proprietary format is no longer enough.

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

· 14 min read
Cazpian Engineering
Platform Engineering Team

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

In our previous post, we broke down the five hidden costs of running two compute engines in your lakehouse — the infrastructure duplication, the cost opacity, the metadata sync bugs, the skills fragmentation, and the governance headaches. We showed that this dual-engine tax can run $40,000+ per year for a mid-size data team.

The obvious question: why not just use Spark for everything?

The honest answer has always been: because Spark cannot deliver query results to BI tools fast enough. Not because Spark cannot execute the query — it usually can — but because the last mile of data delivery through traditional JDBC/ODBC protocols is painfully slow.

Arrow Flight SQL eliminates that bottleneck. And with it, the primary architectural reason for running a second query engine disappears.

Why Your Data Platform Runs Two Engines — And Why That's Costing You

· 11 min read
Cazpian Engineering
Platform Engineering Team

Why Your Data Platform Runs Two Engines — And Why That's Costing You

Take an honest look at your data platform architecture. If you are running a lakehouse on AWS, there is a good chance it looks something like this: Spark clusters for ETL and data engineering, plus Trino (or Dremio, or Presto) clusters for analytics and BI queries. Two engines, two teams, two bills — all pointed at the same data.

This dual-runtime pattern has become the default architecture for most modern data platforms. And on the surface, it makes sense. Spark is great at processing data. Trino is great at querying it. Each engine solves a real problem.

But running two engines has hidden costs that most organizations never quantify — and once you add them up, the number is hard to ignore.

Databricks vs. EMR vs. Cazpian: The 2026 Compute Cost Showdown

· 13 min read
Cazpian Engineering
Platform Engineering Team

Databricks vs. EMR vs. Cazpian: The 2026 Compute Cost Showdown

"Which platform is cheapest for Spark?" is one of the most common questions data teams ask — and one of the most misleading. The honest answer is: it depends entirely on your workload shape.

A platform that saves you thousands on large nightly batch jobs might quietly waste thousands on your fleet of small ETL runs. The billing model that looks transparent at first glance might hide costs in cold starts, minimum increments, or idle compute you never asked for.

In this post — Part 3 of our compute cost series — we compare Databricks, Amazon EMR, and Cazpian across three realistic workload scenarios. No hypotheticals. Real pricing. Real math.

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

· 11 min read
Cazpian Engineering
Platform Engineering Team

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

In Part 1 of this series, we exposed the Small Job Tax — the hidden cost of cold starts, overprovisioned clusters, and per-job infrastructure overhead that silently drains data budgets. We showed that for many teams, more than half of their Spark compute spend goes to infrastructure bootstrapping, not actual data processing.

The natural follow-up question: what if you could eliminate that overhead entirely?

That is exactly what Cazpian Compute Pools are built to do.