Skip to content

The Ultimate Guide to Open Table Formats - Iceberg, Delta Lake, Hudi, Paimon, and DuckLake

Published: at 09:00 AM

Get Data Lakehouse Books:

Join the Data Lakehouse Community Data Lakehouse Blog Roll


Modern lakehouse stacks live or die by how they manage tables on cheap, scalable object storage. That “how” is the job of open table formats, the layer that turns piles of Parquet/ORC files into reliable, ACID-compliant tables with schema evolution, time travel, and efficient query planning. If you’ve ever wrestled with brittle Hive tables, small-file explosions, or “append-only” lakes that can’t handle updates and deletes, you already know why this layer matters.

In this guide, we’ll demystify the five formats you’re most likely to encounter:

We’ll start beginner-friendly, clarifying what a table format is and why it’s essential, then progressively dive into expert-level topics: metadata internals (snapshots, logs, manifests, LSM levels), row-level change strategies (COW, MOR, delete vectors), performance trade-offs, ecosystem support (Spark, Flink, Trino/Presto, DuckDB, warehouses), and adoption trends you should factor into your roadmap.

By the end, you’ll have a practical mental model to choose the right format for your workloads, whether you’re optimizing petabyte-scale analytics, enabling near-real-time CDC, or simplifying your metadata layer for developer velocity.

Why Open Table Formats Exist

Before diving into each format, it’s worth understanding why open table formats became necessary in the first place.

Traditional data lakes, built on raw files like CSV, JSON, or Parquet, were cheap and scalable, but brittle. They had no concept of transactions, which meant if two jobs wrote data at the same time, you could easily end up with partial or corrupted results. Schema evolution was painful, renaming or reordering columns could break queries, and updating or deleting even a single row often meant rewriting entire partitions.

Meanwhile, enterprises still needed database-like features, updates, deletes, versioning, auditing, on their data lakes. That tension set the stage for open table formats. These formats layer metadata and transaction protocols on top of files to give the data lake the brains of a database while keeping its open, flexible nature.

In practice, open table formats deliver several critical capabilities:

In other words, table formats solve the “wild west of files” problem, turning data lakes into lakehouses that balance scalability with structure. The differences among Iceberg, Delta, Hudi, Paimon, and DuckLake lie in how they achieve this and what trade-offs they make to optimize for batch, streaming, or simplicity.

Next, we’ll walk through the history and evolution of each format to see how these ideas took shape.

The Evolution of Open Table Formats

The journey of open table formats reflects the challenges companies faced as data lakes scaled from terabytes to petabytes. Each format emerged to solve specific pain points:

These formats represent waves of innovation:

Next, we’ll dive into Apache Iceberg in detail, its metadata structure, features, and why it has become the default choice for many modern lakehouse deployments.

Apache Iceberg: The Batch-First Powerhouse

Background & Origins
Apache Iceberg was born at Netflix in 2018 and donated to the Apache Software Foundation in 2019. Its mission was clear: fix the long-standing problems of Hive tables, unreliable schema changes, expensive directory scans, and lack of true atomicity. Iceberg introduced a clean-slate design that scaled to petabytes while guaranteeing ACID transactions, schema evolution, and time-travel queries.

Metadata Structure
Iceberg’s metadata model is built on a hierarchy of files:

This design avoids reliance on directory listings, making planning queries over millions of files feasible.

Core Features

Row-Level Changes
Initially copy-on-write, Iceberg now also supports delete files for merge-on-read semantics. Deletes can be tracked separately and applied at read time, reducing write amplification for frequent updates. Background compaction later consolidates these into optimized Parquet files.

Ecosystem & Adoption
Iceberg’s neutrality and technical strengths have driven broad adoption. It is supported in:

By late 2024, Iceberg had become the de facto industry standard for open table formats, with adoption by Netflix, Apple, LinkedIn, Adobe, and major cloud vendors. Its community-driven governance and rapid innovation ensure it continues to evolve, recent features like row-level delete vectors and REST catalogs are making it even more capable.

Next, we’ll look at Delta Lake, the transaction-log–driven format that became the backbone of Databricks’ lakehouse vision.

Delta Lake: The Transaction-Log

Background & Origins
Delta Lake was introduced by Databricks around 2017–2018 to address Spark’s biggest gap: reliable transactions on cloud object storage. Open-sourced in 2019 under the Linux Foundation, Delta Lake became the backbone of Databricks’ lakehouse pitch, combining data warehouse reliability with the scalability of data lakes. Its design centered on a simple but powerful idea: use a transaction log to coordinate all changes.

Metadata Structure
At the core of every Delta table is the _delta_log directory:

This log-based design is simple and easy to reconstruct: replay JSON logs from the last checkpoint to reach the latest state.

Core Features

Row-Level Changes
Delta primarily uses copy-on-write: updates and deletes rewrite entire Parquet files while marking old ones as removed in the log. This guarantees atomicity but can be expensive at scale. To mitigate, Delta introduced deletion vectors (in newer releases), which track row deletions without rewriting whole files, closer to merge-on-read semantics. Upserts are supported via SQL MERGE INTO, commonly used for database change capture workloads.

Ecosystem & Adoption
Delta Lake is strongest in the Spark ecosystem and is the default format in Databricks. It’s also supported by:

While its openness has improved since Delta 2.0, much of its adoption remains tied to Databricks. Still, Delta Lake is one of the most widely used formats in production, powering pipelines at thousands of organizations.

Next, we’ll explore Apache Hudi, the pioneer of incremental processing and near-real-time data lake ingestion.

Apache Hudi: The Incremental Pioneer

Background & Origins
Apache Hudi (short for Hadoop Upserts Deletes and Incrementals) was created at Uber in 2016 to solve a pressing challenge: keeping Hive tables up to date with fresh, continuously changing data. Uber needed a way to ingest ride updates, user changes, and event streams into their Hadoop data lake without waiting hours for batch jobs. Open-sourced in 2017 and donated to Apache in 2019, Hudi became the first widely adopted table format to support row-level upserts and deletes directly on data lakes.

Metadata Structure
Hudi organizes tables around a commit timeline stored in a .hoodie directory:

This dual-mode design gives engineers control over the trade-off between write latency and read latency.

Core Features

Row-Level Changes
Hudi was designed for this problem. In COW mode, updates rewrite files. In MOR mode, updates are appended as log blocks, making them queryable almost immediately. Readers can choose:

Deletes are handled similarly, either as soft deletes in logs or hard deletes during compaction.

Ecosystem & Adoption
Hudi integrates tightly with:

While Iceberg and Delta now dominate conversations, Hudi remains a strong choice for near real-time ingestion and CDC use cases, particularly in AWS-centric stacks. Its flexibility (COW vs MOR) and incremental consumption features make it especially valuable for pipelines that need fast data freshness without sacrificing reliability.

Next, we’ll examine Apache Paimon, the streaming-first format that extends Hudi’s incremental vision with an LSM-tree architecture.

Apache Paimon: Streaming-First by Design

Background & Origins
Apache Paimon began life as Flink Table Store at Alibaba in 2022, targeting the need for continuous, real-time data ingestion directly into data lakes. It entered the Apache Incubator in 2023 under the name Paimon. Unlike Iceberg or Delta, which started with batch analytics and later added streaming features, Paimon was streaming-first. Its mission: make data lakes act like a materialized view that is always up to date.

Metadata & Architecture
Paimon uses a Log-Structured Merge-tree (LSM) design inspired by database internals:

This architecture makes frequent row-level changes cheap (append-only writes) while deferring heavy merges to compaction tasks.

Core Features

Row-Level Changes
Unlike Iceberg (COW with delete files) or Delta (COW with deletion vectors), Paimon is natively merge-on-read. Updates and deletes are appended as small log segments, queryable immediately. Background compaction gradually merges them into optimized columnar files. This makes Paimon highly efficient for high-velocity workloads like IoT streams, CDC pipelines, or real-time leaderboards.

Ecosystem & Adoption
Paimon integrates tightly with Apache Flink, where it feels like a natural extension of Flink SQL. It also has growing support for Spark, Hive, Trino/Presto, and OLAP systems like StarRocks and Doris. Adoption is strongest among teams building streaming lakehouses, particularly those already invested in Flink. While younger than Iceberg or Delta, Paimon is rapidly attracting attention as organizations push for sub-minute data freshness.

Next, we’ll turn to DuckLake, the newest entrant that rethinks table metadata management by moving it entirely into SQL databases.

DuckLake: Metadata Reimagined with SQL

Background & Origins
DuckLake is the newest table format, introduced in 2025 by the DuckDB and MotherDuck teams. Unlike earlier formats that manage metadata with JSON logs or Avro manifests, DuckLake flips the script: it stores all table metadata in a relational SQL database. This approach is inspired by how cloud warehouses like Snowflake and BigQuery already manage metadata internally, but DuckLake makes it open and interoperable.

Metadata & Architecture

This design dramatically reduces the complexity of planning queries (no manifest scanning), makes commits faster, and enables features like cross-table consistency (possible in Apache Iceberg if using the Nessie catalog).

Core Features

Row-Level Changes
DuckLake handles updates and deletes via copy-on-write on Parquet files, but the metadata transaction is nearly instantaneous. Row-level changes are coordinated by the SQL catalog, avoiding the latency and eventual consistency pitfalls of cloud storage–based logs. In effect, DuckLake behaves like Iceberg for data files but with much faster commit cycles.

Ecosystem & Adoption

As of 2025, DuckLake is still young but has sparked excitement by simplifying lakehouse architecture. It’s best seen as a complement to more mature formats, with particular appeal to DuckDB users and teams tired of managing complex metadata stacks.

Next, we’ll step back and compare all five formats side by side, looking at metadata design, row-level update strategies, ecosystem support, and adoption trends.

Comparing the Open Table Formats

Now that we’ve walked through each format individually, let’s compare them across the dimensions that matter most to data engineers and architects.

1. Metadata Architecture

2. Row-Level Changes

3. Ecosystem Support

4. Adoption Trends

Next, we’ll step back and examine industry trends shaping the adoption of these formats and what they signal for the future of the lakehouse ecosystem.

The “table format wars” of the past few years are starting to settle into clear patterns of adoption. While no single format dominates every use case, the industry is coalescing around certain choices based on scale, latency, and ecosystem needs.

Iceberg as the Default Standard
Iceberg has emerged as the most widely supported and vendor-neutral choice. Cloud providers like AWS, Google, and Snowflake have all added native support, and query engines like Trino, Presto, Hive, and Flink integrate with it out-of-the-box. Its Apache governance and cross-engine compatibility make it the safe long-term bet for enterprises standardizing on a single open format.

Delta Lake in the Spark/Databricks World
Delta Lake remains the default in Spark- and Databricks-heavy shops. Its simplicity (transaction logs) and seamless batch/stream integration continue to attract teams already invested in Spark. While its ecosystem is narrower than Iceberg’s, Delta Lake’s deep integration with Databricks runtime and machine learning workflows ensures strong adoption in that ecosystem.

Hudi in CDC and Incremental Ingestion
Hudi carved out a niche in change data capture (CDC) and near real-time ingestion. Telecom, fintech, and e-commerce companies still rely on Hudi for incremental pipelines, especially on AWS where Glue and EMR make it easy to deploy. While Iceberg and Delta have added incremental features, Hudi’s head start and MOR tables keep it relevant for low-latency ingestion scenarios.

Paimon and the Rise of Streaming Lakehouses
As real-time analytics demand grows, Paimon is gaining momentum in the Flink community and among companies building streaming-first pipelines. Its LSM-tree design positions it as the go-to choice for high-velocity data, IoT streams, and CDC-heavy architectures. Although young, its momentum signals a broader shift: the next wave of lakehouse innovation is about sub-minute freshness.

DuckLake and Metadata Simplification
DuckLake reflects a newer trend: rethinking metadata management. By moving metadata into SQL databases, it dramatically simplifies operations and enables cross-table transactions. Adoption is still experimental, but DuckLake has sparked interest among teams who want lakehouse features without managing complex catalogs or metastores. Its trajectory will likely influence how future formats handle metadata.

Convergence and Interoperability
One notable trend: features are converging. Iceberg now supports row-level deletes via delete files; Delta added deletion vectors; Hudi and Paimon both emphasize streaming upserts. Tooling is also evolving toward interoperability, catalog services like Apache Nessie and Polaris aim to support multiple formats, and BI engines increasingly connect to all.

In short:

Next, we’ll wrap up with guidance on how to choose the right format based on your workloads, ecosystem, and data engineering priorities.

Choosing the Right Open Table Format

With five strong options on the table, Iceberg, Delta Lake, Hudi, Paimon, and DuckLake, the choice depends less on “which is best” and more on which aligns with your workloads, ecosystem, and priorities. Here’s how to think about it:

When to Choose Apache Iceberg

When to Choose Delta Lake

When to Choose Apache Hudi

When to Choose Apache Paimon

When to Choose DuckLake

Final Takeaway

No matter which you choose, adopting an open table format is the key to turning your data lake into a true lakehouse: reliable, flexible, and future-proof.

Conclusion

Open table formats are no longer niche, they’re the foundation of the modern data stack. Whether your challenge is batch analytics, real-time ingestion, or simplifying metadata, there’s a format designed to meet your needs. The smart path forward isn’t just picking one blindly, but aligning your choice with your data velocity, tooling ecosystem, and long-term governance strategy.

In practice, many organizations run more than one format side by side. The good news: as open standards mature, interoperability and ecosystem support are expanding, making it easier to evolve over time without locking yourself into a dead end.

The lakehouse era is here, and open table formats are its backbone.

Get Data Lakehouse Books:

Join the Data Lakehouse Community Data Lakehouse Blog Roll