Skip to content
Governance & Security Last updated: May 14, 2026

Iceberg Data Lineage

Iceberg data lineage is the ability to trace the origin, transformation history, and downstream consumption of data in Iceberg tables using snapshot metadata, schema evolution history, and catalog audit logs integrated with lineage platforms like OpenLineage and Apache Atlas.

iceberg data lineageiceberg openlineageiceberg metadata lineageiceberg column lineageiceberg data governance

Iceberg Data Lineage

Data lineage is the ability to trace where data came from, how it was transformed, and where it flows downstream — a critical capability for regulatory compliance (GDPR, CCPA, HIPAA), debugging data quality issues, understanding impact of schema changes, and building trust in analytical results.

Apache Iceberg’s metadata-rich architecture provides a strong foundation for lineage: snapshot history, schema evolution tracking, and catalog audit logs all contribute lineage information. Integrating these with dedicated lineage platforms provides a complete lineage graph.

Built-In Lineage Signals in Iceberg

Snapshot History

Every write creates a new snapshot with metadata:

-- View the full write history for a table
SELECT
    snapshot_id,
    committed_at,
    operation,
    summary['added-records'] as records_added,
    summary['source-query'] as source_query
FROM db.orders.snapshots
ORDER BY committed_at;

Schema Evolution History

-- View schema changes over time
SELECT * FROM db.orders.metadata_log;

The metadata log records every schema change — column additions, renames, type promotions — with timestamps. This tells you exactly when a column appeared, when it was renamed, and in what sequence.

Manifest File Metadata

Each manifest references the snapshot that created it, and each data file entry references its source commit. This creates a chain: data file → manifest → snapshot → operation → source query.

OpenLineage Integration

OpenLineage is the open standard for collecting and representing data lineage events across diverse data tools. Spark, Flink, dbt, Airflow, and many other tools emit OpenLineage events natively.

When Iceberg tables are the inputs or outputs of OpenLineage-instrumented jobs, lineage events capture:

Enabling OpenLineage in Spark with Iceberg

# Spark: enable OpenLineage instrumentation
spark = SparkSession.builder \
    .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") \
    .config("spark.openlineage.transport.type", "http") \
    .config("spark.openlineage.transport.url", "https://my-marquez.example.com") \
    .config("spark.openlineage.namespace", "my_lakehouse") \
    .getOrCreate()

# All Iceberg reads/writes are now automatically tracked
spark.sql("""
    INSERT INTO db.fact_orders
    SELECT * FROM db.orders WHERE status = 'completed'
""")
# OpenLineage event emitted: orders → fact_orders, with snapshot IDs

Viewing Lineage in Marquez (OpenLineage Backend)

Marquez is the reference OpenLineage server. After jobs run, you can query the lineage graph:

# Get the lineage graph for the fact_orders table
curl "http://marquez:5000/api/v1/lineage?nodeId=dataset:my_lakehouse:db.fact_orders"

The response shows all upstream and downstream datasets, the jobs that connected them, and the timestamps of each lineage event.

Apache Atlas Integration

For organizations using Apache Atlas for enterprise data governance:

Impact Analysis: Schema Change Propagation

One of the most practical lineage use cases: “If I change the schema of orders, what downstream tables and reports will be affected?”

With a lineage graph populated from OpenLineage:

  1. Find all downstream datasets of db.orders in the lineage graph.
  2. Identify which downstream consumers use the column being changed.
  3. Notify downstream owners before making the change.
# Example: find all downstream consumers of db.orders
import requests

lineage = requests.get(
    "http://marquez:5000/api/v1/lineage",
    params={"nodeId": "dataset:my_lakehouse:db.orders"}
).json()

downstream_tables = [
    edge["destination"]["name"]
    for edge in lineage["graph"]["edges"]
    if edge["source"]["name"] == "db.orders"
]
print(f"Changing db.orders will affect: {downstream_tables}")

Lineage and Compliance

For GDPR, CCPA, and HIPAA compliance, lineage tells you:

Combined with Iceberg’s GDPR delete capabilities, lineage enables a complete compliance response: identify all affected tables via lineage, then apply row-level deletes across all of them.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base