Iceberg AI Readiness
Iceberg AI readiness refers to the set of architectural properties in Apache Iceberg that make it particularly well-suited as the data foundation for AI and machine learning workloads. These properties go beyond standard analytics requirements — they address specific challenges that AI and ML teams face when working with large-scale datasets.
Key Properties That Make Iceberg AI-Ready
1. Reproducible Training Datasets via Snapshots
ML models must be reproducible: given the same training data, the same model should be produced. Iceberg’s immutable snapshot system enables this naturally.
When training a model:
- Record the snapshot ID used for training.
- Store this snapshot ID alongside the model artifact.
- To reproduce: load the table at the exact snapshot → identical training data.
from pyiceberg.catalog import load_catalog
catalog = load_catalog("my_catalog", **{...})
table = catalog.load_table("ml.training_features")
# Get current snapshot ID before training
training_snapshot_id = table.current_snapshot().snapshot_id
print(f"Training on snapshot: {training_snapshot_id}")
# Load training data
df = table.scan(snapshot_id=training_snapshot_id).to_arrow().to_pandas()
# Train model...
# Record training_snapshot_id in MLflow/metadata
# Later: reproduce the exact training dataset
reproduce_df = table.scan(snapshot_id=training_snapshot_id).to_arrow().to_pandas()
Even if the production table has been updated hundreds of times since training, the original training snapshot remains accessible until explicitly expired.
2. Schema Evolution Without Pipeline Breakage
ML feature pipelines break when source tables change schema. Iceberg’s schema evolution is backward-compatible:
- New columns added after the feature pipeline was written return NULL for historical rows — the pipeline continues to work.
- Dropped columns that the pipeline doesn’t use don’t cause failures.
- Renamed columns can be resolved via schema metadata inspection.
This is crucial for long-running ML systems that must survive table schema changes without emergency pipeline updates.
3. Python-Native Access via PyIceberg
The ML/data science ecosystem is Python-first. PyIceberg provides direct Python access to Iceberg tables without requiring Spark or JVM:
import pyarrow as pa
from pyiceberg.catalog import load_catalog
# No Spark, no JVM, no cluster needed
catalog = load_catalog("my_catalog", type="rest", uri="...")
features = catalog.load_table("ml.user_features").scan(
selected_fields=("user_id", "feature_1", "feature_2", "label")
).to_arrow()
# Direct to PyTorch DataLoader
import torch
from torch.utils.data import Dataset
class IcebergDataset(Dataset):
def __init__(self, arrow_table):
self.data = arrow_table
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return torch.tensor(self.data.slice(idx, 1).to_pandas().values[0])
4. Feature Store Foundation
Iceberg is increasingly used as the storage layer for feature stores:
- Offline feature store: Historical features for model training, stored as Iceberg tables.
- Feature versioning: Each feature computation run creates a new snapshot.
- Point-in-time correct queries: Time travel ensures training uses only features that were available at prediction time (no future leakage).
# Point-in-time correct feature retrieval (prevent future leakage)
user_features = table.scan(
snapshot_id=snapshot_at_training_time,
row_filter="event_date <= '2026-01-01'"
).to_arrow()
5. Governed Access via REST Catalog
AI pipelines need controlled data access:
- Training pipelines should access only their authorized feature sets.
- Model inference should access only the tables needed for feature retrieval.
- Credential vending from the REST Catalog (Apache Polaris) ensures pipelines never hold more access than they need.
6. Interoperability with the ML Ecosystem
Iceberg integrates with the ML ecosystem via:
| Integration | Use Case |
|---|---|
| PyArrow | High-performance columnar data for ML |
| Pandas | Data exploration and feature engineering |
| DuckDB | SQL feature queries without Spark |
| Ray | Distributed ML training on Iceberg data |
| Hugging Face Datasets | Via Arrow table bridge |
| MLflow | Log snapshot IDs with model artifacts |
| Feast | Feature store on Iceberg offline store |
The Agentic AI Data Stack
For AI agents that need to query, reason over, and act on data:
- Apache Iceberg: Open, reproducible, governed data storage.
- PyIceberg / MCP Server: Python-native or agent-native data access.
- Apache Polaris: Catalog for discovery and access control.
- Dremio AI Semantic Layer: Business context for agent understanding.
- Dremio AI Agent: Autonomous analytics execution.
This stack provides AI agents with governed access to structured, versioned, contextualized data — the foundation for trustworthy agentic analytics.