Apache Iceberg vs Apache Hudi

Apache Iceberg and Apache Hudi are both open table formats for cloud lakehouses: Iceberg prioritizes multi-engine interoperability and open governance, while Hudi was designed from the ground up for streaming upserts and incremental data processing with strong Spark integration.

Apache Iceberg and Apache Hudi (Hadoop Upserts Deletes and Incrementals) are both Apache Software Foundation-governed open table formats that solve the “mutable data in object storage” problem for data lakehouses. They approach the problem from different angles, reflecting their different origins and primary use cases.

Origins and Design Philosophy

Aspect	Apache Iceberg	Apache Hudi
Created by	Netflix (2017)	Uber (2016)
Open-sourced	2018	2019
Primary design goal	Multi-engine interoperable table format	Streaming upsert / incremental processing on Hadoop
Governance	Apache Software Foundation	Apache Software Foundation
Primary home	Broad cloud/engine ecosystem	Spark + Hadoop ecosystem

Hudi was born at Uber to solve a specific operational problem: efficiently updating ride-sharing data in HDFS (later S3) without full partition rewrites. Iceberg was born at Netflix to solve a different problem: table format fragility, hidden partitioning complexity, and lack of atomic semantics in the Hive ecosystem.

Architecture Comparison

Table Types

Hudi has two native table types that correspond to write optimization strategies:

Copy-on-Write (CoW): On each write, affected files are rewritten with updates applied. Reads are always clean (no merge needed), but writes are expensive.
Merge-on-Read (MoR): Updates are stored in delta log files. Reads merge base files with deltas. Fast writes, more complex reads. Requires periodic compaction.

Iceberg also supports CoW and MoR, but these are properties of the delete strategy, not fundamentally different table types. Iceberg’s abstractions are more unified.

Transaction Log / Metadata

Hudi: Uses a timeline stored in a .hoodie/ directory. Each commit, clean, compaction, and rollback is an action on the timeline. The timeline is Hudi-specific and requires the Hudi library to interpret.

Iceberg: Uses a snapshot-based metadata tree (metadata JSON → manifest list → manifests → data files). The metadata is self-describing and structured: any client that can read the spec can navigate it.

Indexing

Hudi has built-in, native indexing support:

Bloom filter index (in-memory per file).
Simple index (file-system-based, O(n) key lookups).
HBase index (external HBase lookup for global key tracking).
Bucket index (hash-based, deterministic file placement).

Hudi’s indexing makes it extremely efficient for key-based upserts: given a set of record keys, Hudi can determine which files contain those keys without a full scan.

Iceberg: File-level statistics in manifests + optional bloom filters (Puffin). Iceberg relies on query engines and compaction for clustering rather than native record-key indexing.

Feature Comparison

Feature	Apache Iceberg	Apache Hudi
Time travel	Yes (snapshot-based)	Yes (timeline-based)
Schema evolution	Full (column IDs)	Full
Incremental reads	Yes (snapshot diff)	Excellent (native incremental query)
Streaming upserts	Via MoR + Flink/Spark	Native (core design goal)
Multi-engine reads	Excellent (REST Catalog)	Good (but Hudi-specific connector needed)
Multi-engine writes	Excellent	Spark-primary
Record-key indexing	Via bloom filters	Native (multiple index types)
Partition evolution	Yes	Limited
Hidden partitioning	Yes	No
Open catalog spec	REST Catalog standard	Hive Metastore / registry
Python client	PyIceberg (mature)	Limited

Incremental Processing: Hudi’s Strength

Hudi’s native incremental query mode is more powerful than Iceberg’s snapshot-diff approach for certain streaming use cases:

Hudi incremental query: Query only records changed since a specific commit, including which records were inserted, updated, or deleted with their exact keys.

# Hudi: native incremental read
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", "20260514000000") \
    .load("s3://bucket/orders/")

Iceberg incremental read: File-level diff: identifies files that changed between snapshots, but not individual record-level changes (for append-only) or exact changed keys (for MoR).

For streaming CDC pipelines where you need to know precisely which keys changed (not just which files), Hudi’s native incremental semantics can be more precise.

Multi-Engine: Iceberg’s Strength

Iceberg’s REST Catalog specification enables any engine to read and write Iceberg tables with full catalog services (discovery, access control, credential vending). Hudi’s ecosystem is more Spark-centric:

Dremio: Full Iceberg support (native). Hudi: external table support.
Trino: Full Iceberg support. Hudi: connector available but less mature.
DuckDB: Full Iceberg support. Hudi: limited.
PyIceberg: Full Python client for Iceberg. Hudi: no equivalent native Python library.
Flink: Both supported natively.

When to Choose Apache Iceberg

Multi-engine architecture: Any team using more than one query engine.
Open governance priority: Teams valuing Apache Foundation neutrality.
AI/agent analytics: Dremio’s Agentic Lakehouse, MCP, semantic layer.
Cloud-native deployment: AWS (S3 Tables, Glue, Athena), GCP (BigLake), Azure (Fabric).
General lakehouse use case: Batch ETL, BI, streaming, ML.

When Apache Hudi May Be Preferred

Record-key-centric upsert pipelines: Frequent updates to specific record keys by primary key.
Spark-primary environments: Mature, stable Hudi-Spark integration.
Existing Hudi investment: Organizations with significant existing Hudi tables and pipelines.
Incremental CDC pipelines: Where Hudi’s native incremental semantics provide cleaner change tracking.

The Current Industry Landscape

As of 2025, Apache Iceberg has the broadest multi-engine support and the most active cloud vendor adoption. Hudi remains strong in Spark-centric streaming upsert workloads, particularly in organizations where it was the original adoption. The Hudi project has also been adding REST Catalog support and improving multi-engine compatibility in response to Iceberg’s ecosystem momentum.

For new lakehouse deployments, Apache Iceberg is the safer default for maximum future optionality.

Apache Iceberg vs Apache Hudi