Open Table Format Comparison
The open table format landscape has four primary contenders in 2025: Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon. Each brings a distinct design philosophy, set of strengths, and primary use case. This page provides a comprehensive, vendor-neutral comparison to inform architectural decisions.
The Four Formats at a Glance
| Format | Origin | Governance | Primary Design Goal |
|---|---|---|---|
| Apache Iceberg | Netflix (2017) | Apache Foundation | Multi-engine interoperability |
| Delta Lake | Databricks (2019) | Linux Foundation | Reliable data lake on Spark |
| Apache Hudi | Uber (2016) | Apache Foundation | Streaming upserts + incrementals |
| Apache Paimon | Alibaba/Flink (2022) | Apache Foundation | Streaming lakehouse (LSM-tree) |
Core Architecture
| Architecture | Iceberg | Delta Lake | Hudi | Paimon |
|---|---|---|---|---|
| Metadata format | JSON + Avro manifests | JSON transaction log | Avro timeline | Manifest + changelog |
| Data format | Parquet, ORC, Avro | Parquet | Parquet | ORC, Parquet |
| Snapshot model | Immutable snapshots | Log-based versions | Timeline commits | Snapshots + changelog |
| Native storage | Object storage | Object storage | Object / HDFS | Object storage |
| Index built-in | Bloom filter (Puffin) | Bloom filter, Z-order | Bloom, HBase, Bucket | LSM-based indexes |
Multi-Engine Support
This is where the formats diverge most significantly:
| Engine | Iceberg | Delta Lake | Hudi | Paimon |
|---|---|---|---|---|
| Apache Spark | ✅ Full | ✅ Native (best) | ✅ Full | ✅ Full |
| Apache Flink | ✅ Full | ✅ (limited write) | ✅ Full | ✅ Native (best) |
| Apache Trino | ✅ Full | ✅ Good | ✅ Connector | 🚧 Limited |
| Dremio | ✅ Native | 🔄 Delta connector | 🔄 Limited | ❌ |
| BigQuery | ✅ BigLake | ❌ | ❌ | ❌ |
| Athena | ✅ Full | ✅ Full | ✅ Full | ❌ |
| Snowflake | ✅ Open Catalog | ❌ | ❌ | ❌ |
| DuckDB | ✅ Extension | ❌ | ❌ | ❌ |
| PyIceberg | ✅ Native | ❌ | ❌ | ❌ |
| StarRocks / Doris | ✅ Full | ✅ Full | ✅ Full | 🚧 Limited |
Verdict: Iceberg has the broadest multi-engine read/write support. Delta Lake has the best Spark/Databricks integration. Hudi is strong in Spark + Flink. Paimon is Flink-native with growing Spark support.
Streaming and Real-Time
| Capability | Iceberg | Delta Lake | Hudi | Paimon |
|---|---|---|---|---|
| Streaming reads | ✅ Snapshot-based | ✅ CDC stream | ✅ Incremental pull | ✅ Native |
| Streaming writes | ✅ Flink sink | ✅ Spark Streaming, DLT | ✅ Flink, Spark | ✅ Flink native |
| Native CDC | 🔄 Via Flink | 🔄 Via Spark | ✅ Built-in | ✅ Built-in |
| Key-based upserts | ✅ MoR EqDelete | ✅ Delta merge | ✅ Native (indexed) | ✅ LSM-native |
| Incremental query | ✅ Snapshot diff | ✅ CDF | ✅ Native incremental | ✅ Changelog |
Hudi and Paimon have the strongest native streaming and incremental semantics. Iceberg handles streaming well via Flink but doesn’t have as native a CDC story. Delta Lake provides CDC via Delta Change Data Feed (CDF).
Catalog and Governance
| Catalog | Iceberg | Delta Lake | Hudi | Paimon |
|---|---|---|---|---|
| Open catalog spec | REST Catalog (standard) | Proprietary (Unity) | HMS / REST | HMS / REST |
| Credential vending | ✅ Full (Polaris) | 🔄 Unity only | ❌ | ❌ |
| Multi-engine RBAC | ✅ Polaris RBAC | ✅ Unity RBAC | ❌ | ❌ |
| Open-source catalog | Apache Polaris, Nessie | None (Unity proprietary) | None | None |
| Cloud managed catalog | S3 Tables, BigLake, Glue | Unity (Databricks Cloud) | Glue (limited) | None yet |
Iceberg has the most mature open, multi-engine catalog ecosystem. Delta Lake’s Unity Catalog is powerful but proprietary to Databricks.
Apache Paimon: The Emerging Contender
Apache Paimon (graduated to top-level Apache project in 2024) is the newest entry, originally designed as the “Flink Table Store”:
- LSM-tree (Log-Structured Merge-tree) architecture: Like a database, Paimon uses LSM trees for efficient key-based upserts and point lookups.
- Native Flink integration: Best-in-class Flink write performance.
- Changelog mode: Produces both data and change records simultaneously.
- Lookup joins: Efficient streaming lookup joins against Paimon tables.
Paimon is particularly well-suited for streaming + real-time query scenarios where you need both sub-second streaming writes and efficient point lookups.
Decision Framework
| Requirement | Recommended Format |
|---|---|
| Maximum multi-engine portability | Apache Iceberg |
| All-in Databricks ecosystem | Delta Lake |
| High-frequency key-based upserts (Spark) | Apache Hudi |
| Flink-native streaming + lookups | Apache Paimon |
| AI analytics + semantic layer | Apache Iceberg (Dremio) |
| Open catalog governance | Apache Iceberg (Polaris) |
| Cloud-managed (AWS) | Apache Iceberg (S3 Tables) |
| Cloud-managed (GCP) | Apache Iceberg (BigLake) |
The Industry Direction
The industry has broadly converged on Apache Iceberg as the interoperability standard:
- All cloud providers (AWS, GCP, Azure) have launched native managed Iceberg services.
- Snowflake (competitor to Databricks) co-created Apache Polaris with Dremio.
- Databricks itself added UniForm (Delta → Iceberg compatibility), acknowledging Iceberg’s ecosystem reach.
- The Iceberg REST Catalog specification is the emerging de facto multi-engine catalog API.
For new lakehouse projects in 2025, Apache Iceberg is the default choice for teams that want maximum optionality, open governance, and the broadest engine ecosystem. Other formats are viable in specific, well-defined contexts.