Skip to content
Agentic & AI Last updated: May 14, 2026

Iceberg Feature Store

Apache Iceberg is used as the offline storage layer in ML feature stores, providing point-in-time correct feature retrieval via time travel, versioned feature datasets via snapshots, and Python-native access via PyIceberg for training data preparation.

iceberg feature storeiceberg machine learning featuresiceberg offline feature storeiceberg point in time featuresfeast iceberg feature store

Iceberg as a Feature Store

A feature store is a centralized repository for machine learning features — precomputed, versioned, and governed datasets that serve both model training (offline access) and model inference (online access). Apache Iceberg is increasingly used as the offline feature store layer in production ML platforms, providing the versioning, governance, and Python integration that ML workflows require.

What Makes Iceberg Ideal for Feature Storage

Point-in-Time Correct Feature Retrieval

The most critical requirement for ML feature stores is point-in-time correctness — training data must only use features that were known at the time of the prediction event (no future information leakage).

Iceberg’s time travel makes this natural:

from pyiceberg.catalog import load_catalog
from datetime import datetime

catalog = load_catalog("my_catalog", **{...})
feature_table = catalog.load_table("features.user_features")

# Point-in-time retrieval: only features known as of the event time
event_time = datetime(2026, 3, 15, 12, 0, 0)
snapshot_at_event = feature_table.snapshot_as_of_timestamp(
    int(event_time.timestamp() * 1000)
)

features_at_event = feature_table.scan(
    snapshot_id=snapshot_at_event.snapshot_id
).to_arrow().to_pandas()

Without time travel, teams manually track feature computation timestamps and implement complex join logic to prevent leakage. Iceberg makes this a simple time travel query.

Reproducible Training Datasets

Record the snapshot ID used for each training run and store it alongside the model artifact:

import mlflow

# Record training snapshot for reproducibility
with mlflow.start_run():
    training_snapshot = feature_table.current_snapshot().snapshot_id
    mlflow.log_param("training_snapshot_id", training_snapshot)
    mlflow.log_param("feature_table", "features.user_features")

    # Train model
    df = feature_table.scan(snapshot_id=training_snapshot).to_arrow().to_pandas()
    model = train_model(df)
    mlflow.sklearn.log_model(model, "model")

To reproduce training:

# Exactly recreate training data using recorded snapshot
model_run = mlflow.get_run(run_id)
snapshot_id = int(model_run.data.params["training_snapshot_id"])
df = feature_table.scan(snapshot_id=snapshot_id).to_arrow().to_pandas()

Feature Versioning via CDC

Feature tables are kept current via CDC pipelines:

Operational DB → Flink CDC → Iceberg Feature Table

Each feature update creates a new snapshot. The feature store can serve:

Iceberg Feature Store Architecture

Feature Engineering (Spark/Flink)


Iceberg Feature Tables (offline store)
  ├── user_features (user engagement, demographics)
  ├── product_features (product attributes, popularity)
  └── session_features (session behavior, recency)

         ├── Training pipeline (PyIceberg + point-in-time query → Pandas/PyArrow → Model)
         └── Online store sync (feature freshness → Redis/DynamoDB → serving)

Integration with Feature Store Frameworks

Feast + Iceberg

Feast (the most popular open-source feature store) supports Iceberg as an offline store backend:

from feast import FeatureStore
from feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg import (
    IcebergOfflineStoreConfig
)

store = FeatureStore(
    repo_config={
        "offline_store": IcebergOfflineStoreConfig(
            catalog_name="my_catalog",
            catalog_type="rest",
            uri="https://my-catalog.example.com",
        )
    }
)

With Feast + Iceberg:

Tecton + Iceberg

Tecton (an enterprise feature platform) supports Iceberg as an offline store, providing:

Feature Table Design Patterns

SCD Type 2 Feature History

Store all historical versions of feature values with effective dates:

CREATE TABLE features.user_features (
    user_id        BIGINT NOT NULL,
    feature_date   DATE NOT NULL,
    days_since_signup INT,
    lifetime_orders  INT,
    lifetime_revenue DOUBLE,
    segment          STRING,
    computed_at    TIMESTAMP
) USING iceberg
PARTITIONED BY (months(feature_date))
WRITE ORDERED BY user_id, feature_date;

Latest Features Table (SCD Type 1)

Store only the most recent feature values, updated via upsert:

-- Daily feature refresh via MERGE INTO
MERGE INTO features.user_features_latest AS target
USING daily_computed_features AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Dremio and the Feature Store

Dremio’s Agentic Lakehouse complements the Iceberg feature store:

The combination of Iceberg + Dremio turns the offline feature store into an AI-accessible, semantically-rich data resource — not just a raw storage layer for ML pipelines.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base