Why Every Data Company Is Betting on Apache Iceberg — And What It Means for AI
Something unusual is happening in the data industry. Companies that have spent years — and billions of dollars — building proprietary storage formats are now rallying behind an open-source table format created at Netflix. Snowflake, Databricks, Dremio, Starburst, Teradata, Google BigQuery, AWS — the list keeps growing. They are not just adding Iceberg as a checkbox feature. They are making it central to their platform strategy.
If you are a data engineer, you have almost certainly heard of Apache Iceberg by now. But the more interesting question is not what Iceberg is — it is why every major vendor has decided that their own proprietary format is no longer enough.
The Problem That Created Iceberg: Vendor Lock-In at the Storage Layer
For the past decade, the data industry operated on a simple playbook: build a great compute engine, tie it to a proprietary storage format, and make it painful for customers to leave. It worked.
- Snowflake stored data in its own internal micro-partition format. You could load data in, but getting it out at scale meant costly exports.
- Databricks built Delta Lake — open-source in name, but tightly coupled to the Databricks runtime and its proprietary features like Photon and Unity Catalog.
- Teradata ran on its own storage engine for decades, purpose-built for its optimizer.
- Google BigQuery used Capacitor, a proprietary columnar format invisible to anything outside BigQuery.
Each vendor had valid technical reasons for their format. Proprietary storage allowed deep integration with their query optimizer, better compression, and tighter performance tuning. But it came at a cost that customers increasingly refused to pay: your data was trapped.
Want to run Spark on your Snowflake data? Export it first. Want to query your Delta tables from Trino without going through Databricks? Good luck with full compatibility. Want to switch vendors entirely? Plan a multi-month migration project.
The data lakehouse was supposed to fix this. Store everything in open formats on cloud object storage — Parquet files on S3 — and query it with any engine. But raw Parquet on S3 was not enough. Without ACID transactions, schema enforcement, time travel, or partition evolution, the "open data lake" was more of a dumping ground than a reliable analytics platform.
That is the gap Apache Iceberg fills.
What Makes Iceberg Fundamentally Different
Iceberg is not a compute engine. It is not a database. It is a table format specification — a set of rules for how to organize data files, metadata, and snapshots on cloud object storage so that any engine can reliably read and write the same tables.
Here is what that means in practice:
Engine Independence
An Iceberg table stored on S3 can be read by Spark, Flink, Trino, Dremio, Snowflake, BigQuery, and Athena — all without copying or converting the data. The table format is the contract. Any engine that speaks the Iceberg protocol can participate.
This is fundamentally different from Delta Lake, which despite being open-source, was designed around Spark's transaction log format and requires the Delta Lake library (or compatibility layers like UniForm) for full feature support.
Hidden Partitioning
In traditional Hive-style partitioning, the partition scheme leaks into every query. Change your partition column, and every downstream pipeline breaks. Iceberg decouples the physical partition layout from the logical query interface. You can evolve partitions — switching from daily to hourly partitioning, for example — without rewriting data or breaking existing queries.
Schema Evolution Without Downtime
Add a column, rename a column, widen a type — Iceberg handles it through metadata updates, not data rewrites. Every schema change is tracked, versioned, and backward-compatible. Your existing data stays untouched. Your new data conforms to the new schema. Both coexist seamlessly.
Time Travel and Snapshot Isolation
Every write to an Iceberg table creates an immutable snapshot. You can query the table as it existed at any point in time. This is not just convenient — it is critical for auditing, debugging, and, as we will discuss later, machine learning reproducibility.
ACID Transactions on Object Storage
Iceberg implements optimistic concurrency control that gives you serializable isolation on top of eventually-consistent object stores like S3. Multiple writers can safely operate on the same table without conflicts or corruption.
The Vendor Convergence: Who Is Adopting Iceberg and Why
Snowflake: From Proprietary to Iceberg-Native
Snowflake's move is perhaps the most telling. For years, Snowflake's internal storage format was one of its biggest competitive advantages — and one of the strongest lock-in mechanisms in the industry. Customers loaded data into Snowflake, and it stayed there.
In 2024, Snowflake announced native Iceberg table support, allowing customers to store and query Iceberg tables directly on their own cloud storage with full Snowflake query performance. This was not a minor integration — it was an acknowledgment that customers demand data portability, and fighting that demand is a losing strategy.
Why Snowflake did it: Customer retention through value, not lock-in. If your data is already in Iceberg format on S3, the barrier to trying Snowflake drops to near zero — and Snowflake bets that its query performance and ecosystem will keep you.
Databricks: The $1 Billion Bet
In June 2024, Databricks acquired Tabular — the company founded by Ryan Blue, Daniel Weeks, and Jason Reid, the original creators of Apache Iceberg — for over $1 billion. For a 40-person startup, that is an extraordinary valuation. It signals exactly how critical Iceberg has become.
Databricks had spent years building Delta Lake as their open table format. But the market was speaking clearly: Iceberg was gaining faster adoption, broader engine support, and more community momentum. Rather than fight the tide, Databricks decided to own both sides.
The result is Delta Lake UniForm, which automatically generates Iceberg metadata for Delta tables — allowing Iceberg-compatible engines to read Delta data without conversion. The long-term goal, according to Databricks, is to evolve toward a single, unified open standard.
Why Databricks did it: Defensive and offensive. Defensively, they could not afford to let a competitor acquire the Iceberg creators. Offensively, owning both Delta and Iceberg expertise gives them a path to becoming the neutral standard — and the default platform for whatever that standard becomes.
Dremio: Built on Iceberg From Day One
Dremio's co-founder Tomer Shiran has been one of the most vocal advocates for Apache Iceberg. Dremio was arguably the first commercial platform to build its entire lakehouse experience around Iceberg as the primary table format — not as an afterthought or compatibility layer.
Dremio's approach treats Iceberg as the storage standard and focuses its value-add on the query engine (based on Apache Arrow) and the semantic layer. Your data stays in Iceberg on your object storage. Dremio never copies it into a proprietary format.
Why Dremio did it: Differentiation through openness. In a market where Snowflake and Databricks were competing on lock-in, Dremio bet that a fully open approach would win customers who had been burned by proprietary formats before.
Starburst: Federated Queries Across Iceberg
Starburst, the commercial company behind Trino (formerly PrestoSQL), added deep Iceberg integration as part of its federated query strategy. Starburst's value proposition is querying data wherever it lives — and Iceberg's engine-independent format fits that story perfectly.
With Starburst, you can run federated queries across Iceberg tables on S3, PostgreSQL databases, Kafka streams, and MongoDB collections — all in a single SQL query. Iceberg is the preferred format for the analytical layer.
Why Starburst did it: Iceberg is the natural fit for a federated query engine. When your entire strategy is "query anything, anywhere," you need a storage format that does not care which engine writes the data.
Teradata: The Enterprise Giant Embraces Open
Teradata's adoption of Iceberg is significant because of what Teradata represents: the old guard of enterprise data warehousing. For decades, Teradata ran on proprietary storage tightly coupled to proprietary hardware. Their move to support Iceberg on cloud object storage is a recognition that the market has shifted permanently toward open formats.
Teradata now supports querying and writing Iceberg tables as part of its VantageCloud platform, allowing customers to keep their analytics on Teradata's optimizer while storing data in an open, portable format.
Why Teradata did it: Survival and relevance. Enterprise customers are modernizing, and no CTO wants to sign a new multi-year deal with a vendor that does not support open table formats. Iceberg support is table stakes for the enterprise market in 2026.
The Real Benefits Over Proprietary Formats
Let us move beyond the marketing narratives and look at what Iceberg actually delivers compared to proprietary alternatives:
| Capability | Proprietary Formats | Apache Iceberg |
|---|---|---|
| Engine compatibility | Single vendor (or limited) | Spark, Flink, Trino, Dremio, Snowflake, BigQuery, Athena, and more |
| Data portability | Export required to switch vendors | Data stays in place, switch engines freely |
| Partition evolution | Requires data rewrite | Metadata-only change, no data rewrite |
| Schema evolution | Varies, often limited | Full support (add, rename, widen, reorder columns) |
| Time travel | Vendor-specific implementation | Built into the spec, works across all engines |
| Community governance | Controlled by single vendor | Apache Software Foundation, vendor-neutral |
| Multi-engine writes | Generally not supported | Supported with ACID guarantees |
| Cloud storage | Often vendor-managed | Your storage, your account, your control |
The most important row in that table is multi-engine writes. In a proprietary world, the vendor that writes the data controls access to it. With Iceberg, your Spark ETL pipelines, your Flink streaming jobs, and your Trino ad-hoc queries can all read and write the same tables — safely, with full transactional guarantees.
This is not just a technical convenience. It is a fundamental shift in data architecture economics. When your storage format is vendor-neutral, vendors have to compete on compute performance, developer experience, and ecosystem — not on how difficult it is to leave.
How Iceberg Enables the Next Generation of AI Workloads
Here is where the story gets even more interesting. The same properties that make Iceberg great for analytics are turning it into a foundational layer for AI and machine learning workloads. This is not theoretical — companies like Apple, Netflix, and LinkedIn are already running AI workloads on Iceberg.
Reproducible Training Datasets with Time Travel
Machine learning is only as good as its training data. When a model starts producing unexpected results, the first question is always: "What data did we train on?"
With Iceberg's snapshot isolation, you can tag the exact state of your training dataset at the moment you trained a model. Six months later, you can query that exact snapshot — same rows, same schema, same values — and reproduce the training run. No custom versioning system needed. No copying data to a separate "training archive." The time travel capability is built into the table format itself.
-- Tag the training dataset snapshot
ALTER TABLE ml.training_features CREATE TAG v2_model_training;
-- Six months later, reproduce the exact training data
SELECT * FROM ml.training_features VERSION AS OF 'v2_model_training';
Feature Stores on Open Storage
Feature stores are critical infrastructure for ML teams — they manage, store, and serve the features that models consume. Most feature store implementations today use proprietary backends (DynamoDB, Redis, vendor-specific stores) that create yet another data silo.
Iceberg tables are a natural fit for the offline feature store layer:
- Schema evolution accommodates new features without breaking existing pipelines
- Partition evolution lets you optimize feature retrieval patterns as usage changes
- Time travel provides point-in-time correct feature values for training
- Engine independence means your feature engineering in Spark and your model training in PyTorch can both access the same feature tables directly
Data Quality for LLM Fine-Tuning
Large language model fine-tuning requires massive, curated datasets with strict quality controls. Iceberg's ACID transactions ensure that your fine-tuning datasets are always in a consistent state — even when multiple teams are concurrently curating, cleaning, and enriching the data.
Schema evolution lets you add metadata columns (quality scores, source annotations, toxicity flags) without disrupting existing pipelines. And if a bad data batch corrupts your fine-tuning set, you can roll back to the previous snapshot instantly.
Agentic AI and Dynamic Query Patterns
As AI moves toward agentic systems — autonomous agents that query, analyze, and act on data — the query patterns become unpredictable. Unlike traditional BI dashboards with known access patterns, AI agents generate ad-hoc queries that span datasets in ways no one anticipated.
Iceberg's metadata layer — with its manifest files, column-level statistics, and partition pruning — enables these dynamic queries to run efficiently without requiring a DBA to pre-optimize the storage layout for every possible access pattern. The format is self-describing and self-optimizing in ways that raw Parquet on S3 never could be.
The Open Table Format War Is Over
Two years ago, the data industry was split into three camps: Delta Lake, Apache Iceberg, and Apache Hudi. Each had its advocates, its strengths, and its production deployments.
That war is effectively over. Here is why:
- Databricks acquired Tabular and is building interoperability between Delta and Iceberg through UniForm — tacitly acknowledging that Iceberg compatibility is non-negotiable.
- Snowflake went Iceberg-native — the biggest independent data warehouse chose Iceberg over building their own open format.
- AWS, Google Cloud, and Azure all added native Iceberg support in their managed analytics services.
- The Apache Software Foundation governance means no single vendor controls Iceberg's roadmap — unlike Delta Lake (Linux Foundation, but Databricks-dominated) or Hudi (largely driven by Uber and then AWS).
This does not mean Delta Lake or Hudi will disappear. They will continue to serve their existing user bases. But for new projects and platform bets, Iceberg has become the default choice — the format you adopt unless you have a specific reason not to.
What This Means for Data Engineers
If you are building or operating a data platform today, here is the practical takeaway:
Start with Iceberg. If you are choosing a table format for a new project, choose Iceberg. The engine compatibility, community momentum, and vendor support make it the lowest-risk, highest-optionality choice.
Plan your migration. If you are on Delta Lake or Hudi, start evaluating the migration path. Databricks' UniForm and tools like Apache XTable (formerly OneTable) can provide interoperability bridges while you plan a full migration.
Rethink vendor negotiations. When your data is in Iceberg on your own object storage, your negotiating position with compute vendors changes fundamentally. You are no longer locked in. Use that leverage.
Think about AI from day one. Structure your Iceberg tables with ML workloads in mind — use time travel for dataset versioning, design feature tables with schema evolution in mind, and tag snapshots for model training reproducibility.
The data industry is converging on a single open table format. That has not happened before. And it changes everything about how data platforms are built, priced, and evaluated.
Building a lakehouse on Apache Iceberg? Cazpian provides a fully managed Spark platform with native Iceberg support, zero cold starts, and usage-based pricing — all running in your AWS account. Learn more about our architecture.