Iceberg Feature Store

Apache Iceberg is used as the offline storage layer in ML feature stores, providing point-in-time correct feature retrieval via time travel, versioned feature datasets via snapshots, and Python-native access via PyIceberg for training data preparation.

Iceberg as a Feature Store

A feature store is a centralized repository for machine learning features: precomputed, versioned, and governed datasets that serve both model training (offline access) and model inference (online access). Apache Iceberg is increasingly used as the offline feature store layer in production ML platforms, providing the versioning, governance, and Python integration that ML workflows require.

What Makes Iceberg Ideal for Feature Storage

Point-in-Time Correct Feature Retrieval

The most critical requirement for ML feature stores is point-in-time correctness: training data must only use features that were known at the time of the prediction event (no future information leakage).

Iceberg’s time travel makes this natural:

from pyiceberg.catalog import load_catalog
from datetime import datetime

catalog = load_catalog("my_catalog", **{...})
feature_table = catalog.load_table("features.user_features")

# Point-in-time retrieval: only features known as of the event time
event_time = datetime(2026, 3, 15, 12, 0, 0)
snapshot_at_event = feature_table.snapshot_as_of_timestamp(
    int(event_time.timestamp() * 1000)
)

features_at_event = feature_table.scan(
    snapshot_id=snapshot_at_event.snapshot_id
).to_arrow().to_pandas()

Without time travel, teams manually track feature computation timestamps and implement complex join logic to prevent leakage. Iceberg makes this a simple time travel query.

Reproducible Training Datasets

Record the snapshot ID used for each training run and store it alongside the model artifact:

import mlflow

# Record training snapshot for reproducibility
with mlflow.start_run():
    training_snapshot = feature_table.current_snapshot().snapshot_id
    mlflow.log_param("training_snapshot_id", training_snapshot)
    mlflow.log_param("feature_table", "features.user_features")

    # Train model
    df = feature_table.scan(snapshot_id=training_snapshot).to_arrow().to_pandas()
    model = train_model(df)
    mlflow.sklearn.log_model(model, "model")

To reproduce training:

# Exactly recreate training data using recorded snapshot
model_run = mlflow.get_run(run_id)
snapshot_id = int(model_run.data.params["training_snapshot_id"])
df = feature_table.scan(snapshot_id=snapshot_id).to_arrow().to_pandas()

Feature Versioning via CDC

Feature tables are kept current via CDC pipelines:

Operational DB → Flink CDC → Iceberg Feature Table

Each feature update creates a new snapshot. The feature store can serve:

Current features for online inference (via the latest snapshot, or a dedicated online store).
Historical features for training (via time travel to the relevant snapshot).