Apache Iceberg vs Apache Hudi
Apache Iceberg and Apache Hudi (Hadoop Upserts Deletes and Incrementals) are both Apache Software Foundation-governed open table formats that solve the “mutable data in object storage” problem for data lakehouses. They approach the problem from different angles, reflecting their different origins and primary use cases.
Origins and Design Philosophy
| Aspect | Apache Iceberg | Apache Hudi |
|---|---|---|
| Created by | Netflix (2017) | Uber (2016) |
| Open-sourced | 2018 | 2019 |
| Primary design goal | Multi-engine interoperable table format | Streaming upsert / incremental processing on Hadoop |
| Governance | Apache Software Foundation | Apache Software Foundation |
| Primary home | Broad cloud/engine ecosystem | Spark + Hadoop ecosystem |
Hudi was born at Uber to solve a specific operational problem: efficiently updating ride-sharing data in HDFS (later S3) without full partition rewrites. Iceberg was born at Netflix to solve a different problem: table format fragility, hidden partitioning complexity, and lack of atomic semantics in the Hive ecosystem.
Architecture Comparison
Table Types
Hudi has two native table types that correspond to write optimization strategies:
- Copy-on-Write (CoW): On each write, affected files are rewritten with updates applied. Reads are always clean (no merge needed), but writes are expensive.
- Merge-on-Read (MoR): Updates are stored in delta log files. Reads merge base files with deltas. Fast writes, more complex reads. Requires periodic compaction.
Iceberg also supports CoW and MoR, but these are properties of the delete strategy, not fundamentally different table types. Iceberg’s abstractions are more unified.
Transaction Log / Metadata
Hudi: Uses a timeline stored in a .hoodie/ directory. Each commit, clean, compaction, and rollback is an action on the timeline. The timeline is Hudi-specific and requires the Hudi library to interpret.
Iceberg: Uses a snapshot-based metadata tree (metadata JSON → manifest list → manifests → data files). The metadata is self-describing and structured — any client that can read the spec can navigate it.
Indexing
Hudi has built-in, native indexing support:
- Bloom filter index (in-memory per file).
- Simple index (file-system-based, O(n) key lookups).
- HBase index (external HBase lookup for global key tracking).
- Bucket index (hash-based, deterministic file placement).
Hudi’s indexing makes it extremely efficient for key-based upserts — given a set of record keys, Hudi can determine which files contain those keys without a full scan.
Iceberg: File-level statistics in manifests + optional bloom filters (Puffin). Iceberg relies on query engines and compaction for clustering rather than native record-key indexing.
Feature Comparison
| Feature | Apache Iceberg | Apache Hudi |
|---|---|---|
| Time travel | Yes (snapshot-based) | Yes (timeline-based) |
| Schema evolution | Full (column IDs) | Full |
| Incremental reads | Yes (snapshot diff) | Excellent (native incremental query) |
| Streaming upserts | Via MoR + Flink/Spark | Native (core design goal) |
| Multi-engine reads | Excellent (REST Catalog) | Good (but Hudi-specific connector needed) |
| Multi-engine writes | Excellent | Spark-primary |
| Record-key indexing | Via bloom filters | Native (multiple index types) |
| Partition evolution | Yes | Limited |
| Hidden partitioning | Yes | No |
| Open catalog spec | REST Catalog standard | Hive Metastore / registry |
| Python client | PyIceberg (mature) | Limited |
Incremental Processing: Hudi’s Strength
Hudi’s native incremental query mode is more powerful than Iceberg’s snapshot-diff approach for certain streaming use cases:
Hudi incremental query: Query only records changed since a specific commit, including which records were inserted, updated, or deleted with their exact keys.
# Hudi: native incremental read
spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20260514000000") \
.load("s3://bucket/orders/")
Iceberg incremental read: File-level diff — identifies files that changed between snapshots, but not individual record-level changes (for append-only) or exact changed keys (for MoR).
For streaming CDC pipelines where you need to know precisely which keys changed (not just which files), Hudi’s native incremental semantics can be more precise.
Multi-Engine: Iceberg’s Strength
Iceberg’s REST Catalog specification enables any engine to read and write Iceberg tables with full catalog services (discovery, access control, credential vending). Hudi’s ecosystem is more Spark-centric:
- Dremio: Full Iceberg support (native). Hudi: external table support.
- Trino: Full Iceberg support. Hudi: connector available but less mature.
- DuckDB: Full Iceberg support. Hudi: limited.
- PyIceberg: Full Python client for Iceberg. Hudi: no equivalent native Python library.
- Flink: Both supported natively.
When to Choose Apache Iceberg
- Multi-engine architecture: Any team using more than one query engine.
- Open governance priority: Teams valuing Apache Foundation neutrality.
- AI/agent analytics: Dremio’s Agentic Lakehouse, MCP, semantic layer.
- Cloud-native deployment: AWS (S3 Tables, Glue, Athena), GCP (BigLake), Azure (Fabric).
- General lakehouse use case: Batch ETL, BI, streaming, ML.
When Apache Hudi May Be Preferred
- Record-key-centric upsert pipelines: Frequent updates to specific record keys by primary key.
- Spark-primary environments: Mature, stable Hudi-Spark integration.
- Existing Hudi investment: Organizations with significant existing Hudi tables and pipelines.
- Incremental CDC pipelines: Where Hudi’s native incremental semantics provide cleaner change tracking.
The Current Industry Landscape
As of 2025, Apache Iceberg has the broadest multi-engine support and the most active cloud vendor adoption. Hudi remains strong in Spark-centric streaming upsert workloads, particularly in organizations where it was the original adoption. The Hudi project has also been adding REST Catalog support and improving multi-engine compatibility in response to Iceberg’s ecosystem momentum.
For new lakehouse deployments, Apache Iceberg is the safer default for maximum future optionality.