Skip to main content

4 posts tagged with "AI"

View All Tags

How Apache Iceberg Makes Your Data AI-Ready: Feature Stores, Training Pipelines, and Agentic AI

· 12 min read
Cazpian Engineering
Platform Engineering Team

How Apache Iceberg Makes Your Data AI-Ready

Every AI project starts with the same bottleneck: data. Not the volume of data — most organizations have plenty of that. The bottleneck is data quality, data versioning, and data reproducibility. Can you guarantee that the dataset you trained on last month has not changed? Can you trace exactly which features went into a model prediction? Can you roll back a corrupted training set in minutes instead of days?

These are data engineering problems, not machine learning problems. And Apache Iceberg — originally built for large-scale analytics — turns out to solve them remarkably well.

This post covers four concrete patterns for using Iceberg as the data foundation for AI workloads: feature stores, training data versioning, LLM fine-tuning pipelines, and agentic AI data access.

Time Travel in Apache Iceberg: Beyond the Basics — Auditing, Debugging, and ML Reproducibility

· 12 min read
Cazpian Engineering
Platform Engineering Team

Time Travel in Apache Iceberg: Beyond the Basics

Every Apache Iceberg overview mentions time travel. "Query your data as it existed at any point in time." It sounds impressive, gets a mention in the feature list, and then most teams never use it beyond the occasional ad-hoc debugging query.

That is a missed opportunity. Iceberg's snapshot system is not just a convenience feature — it is a production-grade capability that can replace custom auditing infrastructure, eliminate data recovery anxiety, and solve one of machine learning's hardest problems: dataset reproducibility.

This post goes beyond the basics. We will cover the snapshot architecture, the practical query patterns, branching and tagging, the Write-Audit-Publish pattern, and real-world use cases that make time travel indispensable.

Why Every Data Company Is Betting on Apache Iceberg — And What It Means for AI

· 13 min read
Cazpian Engineering
Platform Engineering Team

Why Every Data Company Is Betting on Apache Iceberg

Something unusual is happening in the data industry. Companies that have spent years — and billions of dollars — building proprietary storage formats are now rallying behind an open-source table format created at Netflix. Snowflake, Databricks, Dremio, Starburst, Teradata, Google BigQuery, AWS — the list keeps growing. They are not just adding Iceberg as a checkbox feature. They are making it central to their platform strategy.

If you are a data engineer, you have almost certainly heard of Apache Iceberg by now. But the more interesting question is not what Iceberg is — it is why every major vendor has decided that their own proprietary format is no longer enough.