Iceberg AI Readiness

Iceberg AI readiness describes the architectural properties that make Apache Iceberg tables ideal for AI and machine learning workloads: immutable snapshot reproducibility, schema-on-read flexibility, governed access via the REST catalog, and Python-native access through PyIceberg.

Iceberg AI readiness refers to the set of architectural properties in Apache Iceberg that make it particularly well-suited as the data foundation for AI and machine learning workloads. These properties go beyond standard analytics requirements: they address specific challenges that AI and ML teams face when working with large-scale datasets.

Key Properties That Make Iceberg AI-Ready

1. Reproducible Training Datasets via Snapshots

ML models must be reproducible: given the same training data, the same model should be produced. Iceberg’s immutable snapshot system enables this naturally.

When training a model:

Record the snapshot ID used for training.
Store this snapshot ID alongside the model artifact.
To reproduce: load the table at the exact snapshot → identical training data.

from pyiceberg.catalog import load_catalog

catalog = load_catalog("my_catalog", **{...})
table = catalog.load_table("ml.training_features")

# Get current snapshot ID before training
training_snapshot_id = table.current_snapshot().snapshot_id
print(f"Training on snapshot: {training_snapshot_id}")

# Load training data
df = table.scan(snapshot_id=training_snapshot_id).to_arrow().to_pandas()

# Train model...
# Record training_snapshot_id in MLflow/metadata

# Later: reproduce the exact training dataset
reproduce_df = table.scan(snapshot_id=training_snapshot_id).to_arrow().to_pandas()

Even if the production table has been updated hundreds of times since training, the original training snapshot remains accessible until explicitly expired.

2. Schema Evolution Without Pipeline Breakage

ML feature pipelines break when source tables change schema. Iceberg’s schema evolution is backward-compatible:

New columns added after the feature pipeline was written return NULL for historical rows: the pipeline continues to work.
Dropped columns that the pipeline doesn’t use don’t cause failures.
Renamed columns can be resolved via schema metadata inspection.

This is crucial for long-running ML systems that must survive table schema changes without emergency pipeline updates.

3. Python-Native Access via PyIceberg

The ML/data science ecosystem is Python-first. PyIceberg provides direct Python access to Iceberg tables without requiring Spark or JVM:

import pyarrow as pa
from pyiceberg.catalog import load_catalog

# No Spark, no JVM, no cluster needed
catalog = load_catalog("my_catalog", type="rest", uri="...")
features = catalog.load_table("ml.user_features").scan(
    selected_fields=("user_id", "feature_1", "feature_2", "label")
).to_arrow()

# Direct to PyTorch DataLoader
import torch
from torch.utils.data import Dataset

class IcebergDataset(Dataset):
    def __init__(self, arrow_table):
        self.data = arrow_table
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return torch.tensor(self.data.slice(idx, 1).to_pandas().values[0])

4. Feature Store Foundation

Iceberg is increasingly used as the storage layer for feature stores:

Offline feature store: Historical features for model training, stored as Iceberg tables.
Feature versioning: Each feature computation run creates a new snapshot.
Point-in-time correct queries: Time travel ensures training uses only features that were available at prediction time (no future leakage).

# Point-in-time correct feature retrieval (prevent future leakage)
user_features = table.scan(
    snapshot_id=snapshot_at_training_time,
    row_filter="event_date <= '2026-01-01'"
).to_arrow()

5. Governed Access via REST Catalog

AI pipelines need controlled data access:

Training pipelines should access only their authorized feature sets.
Model inference should access only the tables needed for feature retrieval.
Credential vending from the REST Catalog (Apache Polaris) ensures pipelines never hold more access than they need.

6. Interoperability with the ML Ecosystem

Iceberg integrates with the ML ecosystem via:

Integration	Use Case
PyArrow	High-performance columnar data for ML
Pandas	Data exploration and feature engineering
DuckDB	SQL feature queries without Spark
Ray	Distributed ML training on Iceberg data
Hugging Face Datasets	Via Arrow table bridge
MLflow	Log snapshot IDs with model artifacts
Feast	Feature store on Iceberg offline store

The Agentic AI Data Stack

For AI agents that need to query, reason over, and act on data:

Apache Iceberg: Open, reproducible, governed data storage.
PyIceberg / MCP Server: Python-native or agent-native data access.
Apache Polaris: Catalog for discovery and access control.
Dremio AI Semantic Layer: Business context for agent understanding.
Dremio AI Agent: Autonomous analytics execution.

This stack provides AI agents with governed access to structured, versioned, contextualized data: the foundation for trustworthy agentic analytics.

Iceberg AI Readiness