Iceberg Data Lineage
Data lineage is the ability to trace where data came from, how it was transformed, and where it flows downstream — a critical capability for regulatory compliance (GDPR, CCPA, HIPAA), debugging data quality issues, understanding impact of schema changes, and building trust in analytical results.
Apache Iceberg’s metadata-rich architecture provides a strong foundation for lineage: snapshot history, schema evolution tracking, and catalog audit logs all contribute lineage information. Integrating these with dedicated lineage platforms provides a complete lineage graph.
Built-In Lineage Signals in Iceberg
Snapshot History
Every write creates a new snapshot with metadata:
committed_at: Exact timestamp of the write.operation: The type of operation (append,overwrite,delete,merge).summary: Details including source query, number of records added/deleted, partition counts.parent-snapshot-id: Links to the previous state, forming a complete audit chain.
-- View the full write history for a table
SELECT
snapshot_id,
committed_at,
operation,
summary['added-records'] as records_added,
summary['source-query'] as source_query
FROM db.orders.snapshots
ORDER BY committed_at;
Schema Evolution History
-- View schema changes over time
SELECT * FROM db.orders.metadata_log;
The metadata log records every schema change — column additions, renames, type promotions — with timestamps. This tells you exactly when a column appeared, when it was renamed, and in what sequence.
Manifest File Metadata
Each manifest references the snapshot that created it, and each data file entry references its source commit. This creates a chain: data file → manifest → snapshot → operation → source query.
OpenLineage Integration
OpenLineage is the open standard for collecting and representing data lineage events across diverse data tools. Spark, Flink, dbt, Airflow, and many other tools emit OpenLineage events natively.
When Iceberg tables are the inputs or outputs of OpenLineage-instrumented jobs, lineage events capture:
- Inputs: Which Iceberg tables were read (by table name and snapshot ID).
- Outputs: Which Iceberg tables were written (by table name and new snapshot ID).
- Job metadata: The Spark/Flink/Airflow job that performed the transformation.
- Column-level lineage (for supported sources): Which input columns mapped to which output columns.
Enabling OpenLineage in Spark with Iceberg
# Spark: enable OpenLineage instrumentation
spark = SparkSession.builder \
.config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") \
.config("spark.openlineage.transport.type", "http") \
.config("spark.openlineage.transport.url", "https://my-marquez.example.com") \
.config("spark.openlineage.namespace", "my_lakehouse") \
.getOrCreate()
# All Iceberg reads/writes are now automatically tracked
spark.sql("""
INSERT INTO db.fact_orders
SELECT * FROM db.orders WHERE status = 'completed'
""")
# OpenLineage event emitted: orders → fact_orders, with snapshot IDs
Viewing Lineage in Marquez (OpenLineage Backend)
Marquez is the reference OpenLineage server. After jobs run, you can query the lineage graph:
# Get the lineage graph for the fact_orders table
curl "http://marquez:5000/api/v1/lineage?nodeId=dataset:my_lakehouse:db.fact_orders"
The response shows all upstream and downstream datasets, the jobs that connected them, and the timestamps of each lineage event.
Apache Atlas Integration
For organizations using Apache Atlas for enterprise data governance:
- Atlas hooks for Spark (
spark-atlas-connector) capture Iceberg table reads/writes as Atlas lineage events. - Atlas’s lineage graph visualization shows the full upstream/downstream dependency chain.
- Atlas classifications (sensitivity labels) can be propagated: if a source Iceberg table is classified as PII, that classification automatically propagates to derived tables.
Impact Analysis: Schema Change Propagation
One of the most practical lineage use cases: “If I change the schema of orders, what downstream tables and reports will be affected?”
With a lineage graph populated from OpenLineage:
- Find all downstream datasets of
db.ordersin the lineage graph. - Identify which downstream consumers use the column being changed.
- Notify downstream owners before making the change.
# Example: find all downstream consumers of db.orders
import requests
lineage = requests.get(
"http://marquez:5000/api/v1/lineage",
params={"nodeId": "dataset:my_lakehouse:db.orders"}
).json()
downstream_tables = [
edge["destination"]["name"]
for edge in lineage["graph"]["edges"]
if edge["source"]["name"] == "db.orders"
]
print(f"Changing db.orders will affect: {downstream_tables}")
Lineage and Compliance
For GDPR, CCPA, and HIPAA compliance, lineage tells you:
- Where does PII data originate (which Iceberg source tables)?
- Which downstream tables derive from those sources?
- If a user requests erasure, which derived tables also need to be updated?
Combined with Iceberg’s GDPR delete capabilities, lineage enables a complete compliance response: identify all affected tables via lineage, then apply row-level deletes across all of them.