Skip to main content

How Apache Iceberg Makes Your Data AI-Ready: Feature Stores, Training Pipelines, and Agentic AI

· 12 min read
Cazpian Engineering
Platform Engineering Team

How Apache Iceberg Makes Your Data AI-Ready

Every AI project starts with the same bottleneck: data. Not the volume of data — most organizations have plenty of that. The bottleneck is data quality, data versioning, and data reproducibility. Can you guarantee that the dataset you trained on last month has not changed? Can you trace exactly which features went into a model prediction? Can you roll back a corrupted training set in minutes instead of days?

These are data engineering problems, not machine learning problems. And Apache Iceberg — originally built for large-scale analytics — turns out to solve them remarkably well.

This post covers four concrete patterns for using Iceberg as the data foundation for AI workloads: feature stores, training data versioning, LLM fine-tuning pipelines, and agentic AI data access.

Why Traditional Data Platforms Fail AI Workloads

Before diving into solutions, let us understand why existing data infrastructure struggles with ML:

Detailed diagram showing how Iceberg enables AI workloads through feature stores, training data versioning with snapshots and tags, LLM fine-tuning pipelines, and agentic AI data access patterns

No dataset versioning. When your training data is a Hive table on S3, the table is mutable. Every ETL run overwrites or appends data. Once you train a model, the exact dataset is gone — overwritten by tomorrow's pipeline. Reproducing a training run from three months ago is either impossible or requires custom versioning infrastructure.

Feature inconsistency. Feature engineering often happens in notebooks, one-off scripts, or team-specific pipelines. The same feature computed differently by two teams leads to model inconsistencies. Feature stores solve this, but most implementations require yet another proprietary backend.

Schema drift breaks pipelines. ML pipelines are brittle. A schema change upstream — a renamed column, a changed type, a new field — can silently corrupt feature computation without any error. By the time someone notices, the model has been retrained on bad features.

No transactional writes. When multiple teams contribute to training datasets concurrently — data curation, labeling, quality scoring — without ACID guarantees, the dataset can end up in an inconsistent state. Half-written batches, duplicate records, or missing labels are common.

Iceberg addresses every one of these problems through capabilities it already has — not through AI-specific features bolted on, but through the core table format design.

Pattern 1: Feature Stores on Iceberg

A feature store is the central repository where ML teams store, manage, and serve the computed features that models consume. The typical architecture has two layers:

  • Offline store: Historical feature values for training — needs to be queryable, versioned, and efficient for large scans
  • Online store: Latest feature values for inference — needs sub-millisecond latency

Iceberg is a natural fit for the offline store layer. Here is why and how.

Why Iceberg for Offline Feature Storage

RequirementIceberg Capability
Feature versioningSnapshots + tags = every feature table state is preserved
Schema evolutionAdd new features without breaking existing pipelines
Point-in-time lookupsTime travel returns features as of any timestamp
Large-scale scans for trainingColumnar Parquet + partition pruning = fast batch reads
Multi-team writesACID transactions prevent concurrent write conflicts
Engine independenceSpark for feature engineering, PyIceberg for lightweight access

Building a Feature Table

-- Create a user features table
CREATE TABLE catalog.ml.user_features (
user_id STRING,
feature_timestamp TIMESTAMP,
-- Behavioral features
total_orders_30d INT,
avg_order_value DOUBLE,
days_since_last_order INT,
-- Engagement features
page_views_7d INT,
session_duration_avg DOUBLE,
-- Demographic features
account_age_days INT,
preferred_category STRING
) USING iceberg
PARTITIONED BY (day(feature_timestamp))
TBLPROPERTIES (
'write.distribution-mode' = 'hash',
'write.target-file-size-bytes' = '268435456'
);

Feature Engineering Pipeline

-- Daily feature computation (runs as a Spark job)
INSERT INTO catalog.ml.user_features
SELECT
u.user_id,
current_timestamp() AS feature_timestamp,
-- Behavioral
COUNT(o.order_id) AS total_orders_30d,
AVG(o.total_amount) AS avg_order_value,
DATEDIFF(current_date(), MAX(o.order_date)) AS days_since_last_order,
-- Engagement
SUM(e.page_views) AS page_views_7d,
AVG(e.session_duration) AS session_duration_avg,
-- Demographic
DATEDIFF(current_date(), u.created_at) AS account_age_days,
u.preferred_category
FROM catalog.db.users u
LEFT JOIN catalog.db.orders o
ON u.user_id = o.user_id
AND o.order_date >= date_sub(current_date(), 30)
LEFT JOIN catalog.db.engagement e
ON u.user_id = e.user_id
AND e.event_date >= date_sub(current_date(), 7)
GROUP BY u.user_id, u.created_at, u.preferred_category;

Adding New Features Without Breaking Anything

When the data science team needs a new feature, schema evolution makes it seamless:

-- Add a new feature column
ALTER TABLE catalog.ml.user_features ADD COLUMN
cart_abandonment_rate DOUBLE
COMMENT 'Added 2026-03. Null for feature rows before March 2026.';

-- The next pipeline run computes the new feature alongside existing ones
-- Old feature rows return null for the new column — expected and handled by ML code

No rewrite. No downtime. No coordination with other teams consuming the feature table.

Pattern 2: Training Data Versioning

The single most impactful use of Iceberg for ML is training data versioning using snapshots and tags. This solves the reproducibility problem that plagues every ML team.

The Problem

You trained a model in January. It performed well. In March, the model degrades. The first debugging step is: "Let me retrain on the January data to see if the model itself changed or the data changed."

Without versioning, this is impossible. The January data was overwritten by February's pipeline. You might have a backup somewhere, but it is a separate copy that may or may not match what the model actually trained on.

The Iceberg Solution

-- Step 1: Before training, tag the dataset
ALTER TABLE catalog.ml.user_features
CREATE TAG training_churn_model_v3
RETAIN 730 DAYS;

-- Step 2: Record the tag in your model registry
-- mlflow.log_param("dataset_tag", "training_churn_model_v3")
-- mlflow.log_param("dataset_table", "catalog.ml.user_features")

Now, at any point in the future:

-- Reproduce the exact training dataset
SELECT * FROM catalog.ml.user_features
VERSION AS OF 'training_churn_model_v3';

This returns the exact rows, with the exact schema, that existed when the tag was created. Not an approximation. Not a backup. The actual data.

Comparing Training Data Over Time

When a model degrades, compare the training data to the current data:

-- Distribution comparison: training vs current
WITH training AS (
SELECT
AVG(total_orders_30d) AS avg_orders,
AVG(avg_order_value) AS avg_value,
STDDEV(avg_order_value) AS std_value,
COUNT(*) AS row_count
FROM catalog.ml.user_features
VERSION AS OF 'training_churn_model_v3'
),
current_data AS (
SELECT
AVG(total_orders_30d) AS avg_orders,
AVG(avg_order_value) AS avg_value,
STDDEV(avg_order_value) AS std_value,
COUNT(*) AS row_count
FROM catalog.ml.user_features
)
SELECT
'training' AS dataset, t.*
FROM training t
UNION ALL
SELECT
'current' AS dataset, c.*
FROM current_data c;

If the distributions have shifted significantly, you have found data drift — and you can quantify it without any additional tooling.

PyIceberg for Lightweight ML Access

Not everything needs Spark. PyIceberg provides a Python-native interface to Iceberg tables — perfect for notebook-based ML workflows and serverless environments.

from pyiceberg.catalog import load_catalog

# Connect to the catalog
catalog = load_catalog("rest", uri="https://polaris-endpoint")

# Load the feature table
table = catalog.load_table("ml.user_features")

# Read features for training (as Arrow table for zero-copy to pandas/PyTorch)
scan = table.scan(
row_filter="feature_timestamp >= '2026-01-01'",
selected_fields=("user_id", "total_orders_30d", "avg_order_value",
"days_since_last_order", "page_views_7d")
)
arrow_table = scan.to_arrow()

# Convert to pandas for sklearn
df = arrow_table.to_pandas()

# Or convert to PyTorch tensors directly from Arrow
import torch
features = torch.tensor(df[["total_orders_30d", "avg_order_value"]].values)

PyIceberg reads Parquet files directly from S3 — no Spark cluster needed. For training datasets under 100 GB, this is often all you need.

Pattern 3: LLM Fine-Tuning Data Pipelines

Fine-tuning large language models requires massive, curated datasets with strict quality controls. Iceberg provides the transactional foundation that makes these pipelines reliable.

The Challenge

LLM fine-tuning datasets are living artifacts. Multiple teams contribute:

  • Data curation team: Selects and preprocesses raw text
  • Labeling team: Adds quality labels and annotations
  • Safety team: Flags and removes toxic or harmful content
  • Domain experts: Validates domain-specific accuracy

Without transactional guarantees, these concurrent operations create inconsistencies — partially labeled batches, records flagged as toxic but not yet removed, duplicate entries from overlapping curation runs.

Iceberg-Based Fine-Tuning Pipeline

-- Fine-tuning dataset table
CREATE TABLE catalog.ml.finetune_dataset (
record_id STRING,
source STRING,
text_content STRING,
instruction STRING,
response STRING,
quality_score DOUBLE,
toxicity_score DOUBLE,
is_approved BOOLEAN,
domain STRING,
curator_id STRING,
labeled_at TIMESTAMP,
created_at TIMESTAMP
) USING iceberg
PARTITIONED BY (domain)
TBLPROPERTIES (
'write.distribution-mode' = 'hash'
);

Safe Concurrent Updates with Branching

Use the Write-Audit-Publish pattern to ensure quality:

-- Safety team creates a branch for toxicity filtering
ALTER TABLE catalog.ml.finetune_dataset CREATE BRANCH safety_review;

-- Bulk update toxicity scores on the branch
UPDATE catalog.ml.finetune_dataset.branch_safety_review
SET toxicity_score = compute_toxicity(text_content),
is_approved = CASE WHEN compute_toxicity(text_content) < 0.3 THEN TRUE ELSE FALSE END
WHERE toxicity_score IS NULL;

-- Review the results before publishing
SELECT
COUNT(*) AS total,
SUM(CASE WHEN is_approved THEN 1 ELSE 0 END) AS approved,
SUM(CASE WHEN NOT is_approved THEN 1 ELSE 0 END) AS rejected
FROM catalog.ml.finetune_dataset VERSION AS OF 'safety_review';

-- If results look good, publish to main
CALL catalog.system.fast_forward(
'ml.finetune_dataset', 'main', 'safety_review'
);

Dataset Snapshots for Training Runs

-- Tag the dataset for a specific fine-tuning run
ALTER TABLE catalog.ml.finetune_dataset
CREATE TAG finetune_run_20260215
RETAIN 365 DAYS;

-- Export approved records for training
SELECT instruction, response
FROM catalog.ml.finetune_dataset
VERSION AS OF 'finetune_run_20260215'
WHERE is_approved = TRUE
AND quality_score >= 0.8;

If the fine-tuned model behaves unexpectedly, you can trace back to the exact training data, examine which records were included, and identify whether a data quality issue caused the problem.

Schema Evolution for Iterative Curation

As your curation process matures, you need to add metadata fields:

-- Add source quality metadata
ALTER TABLE catalog.ml.finetune_dataset ADD COLUMN
source_reliability_score DOUBLE;

-- Add annotation for RLHF (Reinforcement Learning from Human Feedback)
ALTER TABLE catalog.ml.finetune_dataset ADD COLUMN
human_preference_rank INT;

-- Add language detection
ALTER TABLE catalog.ml.finetune_dataset ADD COLUMN
detected_language STRING;

Each addition is metadata-only. Old records get null for the new fields. No data rewrite, no pipeline disruption.

Pattern 4: Agentic AI and Dynamic Query Patterns

The emerging pattern of agentic AI — autonomous agents that query, reason about, and act on data — creates a fundamentally different data access pattern than traditional BI or batch analytics.

How Agents Query Differently

Traditional BI has predictable query patterns. A dashboard queries the same tables, with the same filters, on a known schedule. You can pre-optimize the storage layout for these patterns.

AI agents do not work this way. An agent tasked with "analyze sales trends and recommend inventory adjustments" might:

  1. Query orders by region and date_range
  2. Join with products by category
  3. Aggregate inventory by warehouse_location
  4. Scan supplier_performance by lead_time

The exact tables, filters, and join patterns depend on the agent's reasoning path — which varies per invocation. No human DBA can pre-optimize for every possible path.

Why Iceberg Handles This Well

Self-describing metadata. Iceberg's manifest files contain column-level min/max statistics, null counts, and partition information for every data file. An agent's query planner can prune irrelevant files without scanning them — regardless of the access pattern.

Hidden partitioning. The agent does not need to know how the data is physically partitioned. It writes SQL against logical columns, and Iceberg's partition pruning handles the rest. This is critical because agents generate SQL programmatically — they should not need to understand partition schemes.

Schema discovery. An AI agent can introspect an Iceberg catalog to discover available tables, their schemas, and their relationships:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("rest", uri="https://polaris-endpoint")

# Agent discovers available tables
tables = catalog.list_tables("analytics")

# Agent inspects a table's schema
table = catalog.load_table("analytics.orders")
schema = table.schema()

# Agent generates and executes a query based on schema understanding
for field in schema.fields:
print(f"{field.name}: {field.field_type} (id: {field.field_id})")

Concurrent reads and writes. Agents may query tables while ETL pipelines are actively writing to them. Iceberg's snapshot isolation guarantees that the agent always reads a consistent view — no dirty reads, no partial writes, no phantom rows.

Practical Pattern: Agent-Accessible Data Catalog

Structure your Iceberg catalog so that AI agents can navigate it effectively:

catalog/
analytics/ -- business metrics and KPIs
orders
revenue_daily
customer_segments
ml/ -- ML-specific tables
user_features
model_predictions
finetune_dataset
raw/ -- raw ingested data
clickstream
server_logs
api_events

Each namespace groups related tables that an agent might need together. The agent queries the catalog API to discover what is available, reads the schema to understand the structure, and generates SQL based on its reasoning.

The Unified Architecture: One Platform for Analytics and AI

The most powerful aspect of using Iceberg for AI workloads is that it eliminates the data silo between analytics and ML. Your analytics tables and your ML feature tables live in the same catalog, on the same storage, governed by the same policies.

                    ┌─────────────────────────┐
│ Apache Polaris Catalog │
│ (Single Source of Truth)│
└────────────┬────────────┘

┌────────────────────┼────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ Analytics │ │ ML Features │ │ AI Agents │
│ (Spark/Trino) │ │ (Spark/ │ │ (PyIceberg/ │
│ │ │ PyIceberg) │ │ REST API) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└────────────────────┼────────────────────┘

┌────────────▼────────────┐
│ Iceberg Tables on S3 │
│ (Single Storage Layer) │
└─────────────────────────┘

No data copying between systems. No sync jobs to keep feature stores up to date. No separate versioning infrastructure. Iceberg provides the data contract that all consumers — analytical, ML, and AI — share.

Getting Started: A Minimal AI-Ready Iceberg Setup

If you are just starting, here is the minimum viable setup:

1. Create a Feature Table

CREATE TABLE catalog.ml.features (
entity_id STRING,
feature_timestamp TIMESTAMP,
feature_1 DOUBLE,
feature_2 DOUBLE,
feature_3 STRING
) USING iceberg
PARTITIONED BY (day(feature_timestamp));

2. Populate Features Daily

INSERT INTO catalog.ml.features
SELECT entity_id, current_timestamp(), f1, f2, f3
FROM your_feature_engineering_query;

3. Tag Before Training

ALTER TABLE catalog.ml.features
CREATE TAG my_model_v1_data RETAIN 365 DAYS;

4. Train with PyIceberg

catalog = load_catalog("rest", uri="https://polaris-endpoint")
table = catalog.load_table("ml.features")
df = table.scan(snapshot_id=tag_snapshot_id).to_arrow().to_pandas()
model.fit(df[feature_cols], df[target_col])

5. Debug with Time Travel

SELECT * FROM catalog.ml.features
VERSION AS OF 'my_model_v1_data';

That is it. Five steps to AI-ready data infrastructure — built on the same Iceberg tables your analytics team already uses.


Building AI workloads on Iceberg? Cazpian provides a fully managed Spark and Iceberg platform with Apache Polaris catalog, zero cold starts, and policy-based table maintenance — giving your ML team the data infrastructure they need without the operational overhead. Learn more.