Apache Iceberg vs Delta Lake
Apache Iceberg and Delta Lake are the two dominant open table formats for cloud data lakehouses, and comparing them is one of the most common questions in the modern data engineering world. The right choice depends on your existing ecosystem, governance requirements, and how important multi-engine portability is to your architecture.
Origins and Governance
| Aspect | Apache Iceberg | Delta Lake |
|---|---|---|
| Created by | Netflix (2017) | Databricks (2019) |
| Governance | Apache Software Foundation | Linux Foundation Delta Lake Project |
| Open source | Yes (Apache 2.0) | Yes (Apache 2.0) |
| Specification | Open, community-governed spec | Controlled primarily by Databricks |
| Vendor neutrality | Fully vendor-neutral | Databricks-first (UniForm for interop) |
Core Architecture Differences
Transaction Log
Delta Lake: Stores a transaction log as JSON files (_delta_log/) with sequential JSON commit files (00000.json, 00001.json, …) and periodic Parquet checkpoint files. The transaction log is a flat sequence of operations.
Apache Iceberg: Stores metadata as a hierarchy: metadata JSON → manifest list → manifest files → data files. Each snapshot is a complete, self-describing metadata tree with rich per-file statistics.
Iceberg’s approach enables:
- Richer file statistics: Per-file min/max, null counts stored in manifests (usable for skipping without reading data files).
- Faster metadata operations: Manifest-level partition elimination before reading any data.
- Cleaner snapshot semantics: Each snapshot is a complete, addressable state — no need to “replay” a log to find current state.
Delta Lake’s log approach is simpler to implement but can accumulate many JSON files in high-commit scenarios (mitigated by checkpointing).
Catalog and Multi-Engine Support
Apache Iceberg: Built around a fully abstracted, engine-neutral catalog API. Any catalog (HMS, Glue, Nessie, Polaris, JDBC) implements the catalog interface. Multiple engines connect via the Iceberg REST Catalog specification — a published, open API standard.
Delta Lake: Natively integrated with the Databricks Unity Catalog. Multi-engine support added retroactively via UniForm (Delta tables automatically generate Iceberg metadata), but external engines are read-only on UniForm tables.
Row-Level Deletes
Apache Iceberg: First-class, spec-level support for row-level deletes via:
- Positional delete files (Spec v2).
- Equality delete files (Spec v2).
- Deletion vectors (Spec v3 — compact, bitmap-based).
Delta Lake: Row-level deletes via deletion vectors (Delta 2.0+) stored as compact bitmaps. Functionally similar to Iceberg’s deletion vector approach but implemented in the Delta transaction log.
Feature Comparison
| Feature | Apache Iceberg | Delta Lake |
|---|---|---|
| Time travel | Yes (snapshot-based) | Yes (version-based) |
| Schema evolution | Full (column IDs) | Full |
| Partition evolution | Yes (metadata-only) | Partial (requires rewrites) |
| Hidden partitioning | Yes | No |
| Multi-engine reads | Excellent (REST Catalog) | Good (UniForm — read-only) |
| Multi-engine writes | Excellent | Limited (Databricks primary writer) |
| Branching and tagging | Yes (table-level) | No (catalog-level via Unity) |
| Open catalog standard | REST Catalog spec | Unity Catalog API (proprietary) |
| Streaming support | Yes (Flink, Spark Streaming) | Yes (Spark Streaming, DLT) |
| DML (UPDATE/DELETE/MERGE) | Full (Spec v2) | Full |
| Governance | Apache Foundation | Linux Foundation |
Ecosystem Support
Apache Iceberg is supported natively by every major query engine:
- Spark, Flink, Trino, Presto, Dremio, Athena, BigQuery, Snowflake, DuckDB, PyIceberg, StarRocks, Doris, Hive 4.x, and more.
- AWS, GCP, and Azure all have native managed Iceberg services (S3 Tables, BigLake, Fabric).
Delta Lake is supported by:
- Spark (best-in-class), Databricks Runtime (best-in-class).
- Trino, Flink, Redshift, BigQuery, Athena, and others via the Delta connector.
- Databricks Unity Catalog as primary governance.
Delta’s ecosystem is narrower than Iceberg’s, particularly outside the Spark/Databricks world.
When to Choose Apache Iceberg
- You need true multi-engine writes (Spark + Flink + Dremio + Trino, all as first-class writers).
- You require open catalog governance (Apache Polaris, Nessie, Glue — all vendor-neutral).
- You are building on AWS, GCP, or Azure without Databricks as the center.
- You need AI agent and MCP integration via Dremio’s Agentic Lakehouse.
- Governance and compliance via Apache Foundation ownership.
When Delta Lake May Be Preferred
- You are all-in on Databricks and want the tightest platform integration.
- Your primary compute is Spark via Databricks Runtime.
- You rely on Databricks-specific features (Delta Live Tables, Photon, DBSQL).
- You want Unity Catalog as your governance platform within the Databricks world.
The Industry Trajectory
Since 2023, the industry has clearly consolidated on Apache Iceberg as the multi-engine interoperability standard. Evidence:
- Snowflake co-created Apache Polaris with Dremio and launched Snowflake Open Catalog.
- AWS launched S3 Tables (native managed Iceberg).
- Google launched BigLake Managed Tables (native Iceberg).
- Databricks itself added UniForm (Delta → Iceberg compatibility) — acknowledging Iceberg’s ecosystem dominance.
Even Delta Lake’s vendor (Databricks) has invested in Iceberg compatibility, which signals where the ecosystem is heading.