Open Table Formats Performance Benchmark: Iceberg, Delta, and Hudi in Production
Choosing an open table format is one of the most critical decisions in architecting a modern data lakehouse. The format you select controls how data is written, how metadata is tracked, and how query engines scan files on object storage. In this benchmark, we evaluate the performance of Apache Iceberg, Delta Lake, and Apache Hudi. We examine how they perform under heavy read, write, and maintenance workloads.
Many benchmarks published by vendors are configured to highlight the strengths of their own platforms. To maintain neutrality, this guide aggregates real-world data from independent testing suites, TPC-DS benchmarks, and production platform runs. We detail the exact library versions, engine configurations, cluster hardware, and query conditions used. This ensures you can reproduce these results in your own environments.
1. Benchmark Infrastructure and Library Versions
All benchmark workloads were executed on identical hardware configurations to ensure a fair comparison. The environment consists of the following components:
- Compute Cluster: 8 worker nodes of type
r6g.4xlargeon AWS (16 vCPUs, 128 GB RAM per node) plus 1 master coordinator node. - Storage Layer: Amazon S3 object storage bucket located in the same AWS region as the compute cluster to minimize network latency.
- Data Catalog: Apache Polaris (REST catalog) running on a dedicated instance to manage transaction coordination and pointer swaps.
- Base Operating System: Amazon Linux 2 with Java Development Kit (JDK) version 17.
The software versions selected represent production-ready releases for each format:
| Software Component | Version Used | Key Configured Parameters |
|---|---|---|
| Apache Spark | 3.5.0 | spark.sql.extensions, spark.sql.catalog defined
for each format. |
| Apache Iceberg | 1.5.0 | Default format version 2 (supporting row-level deletes). |
| Delta Lake | 3.1.0 | Default configurations, schema evolution enabled. |
| Apache Hudi | 0.15.0 | Bootstrap configuration, indexing type optimized per test. |
To enable Apache Spark to communicate with all three table formats, we configured the Spark session using the following system properties:
# Apache Iceberg Configurations
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.type=rest
spark.sql.catalog.demo.uri=http://polaris-service:8181/api/v1
spark.sql.catalog.demo.warehouse=s3a://lakehouse-warehouse/iceberg
spark.sql.catalog.demo.io-impl=org.apache.iceberg.aws.s3.S3FileIO
# Delta Lake Configurations
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
# Apache Hudi Configurations
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.hudi=org.apache.spark.sql.hudi.catalog.HoodieCatalog For the Spark execution context, we configured JVM garbage collection to use the G1 Garbage Collector (G1GC) with a target pause time of 200 milliseconds. This stabilizes heap memory usage during heavy shuffles. The Spark executors were allocated 5 cores and 32 GB of memory each, leaving 8 GB of memory per node for the host operating system and local storage utilities.
2. Test Dataset Characteristics
The test suite uses a scaled-up dataset based on the TPC-DS schema, generating a total of 1 billion rows of transaction data. The characteristics of the dataset are as follows:
- Total Raw Size: 520 GB of raw CSV equivalent data, compressed into optimized Parquet files.
- Schema Structure: 28 columns containing a mix of integers, decimals, high-cardinality strings (like customer IDs), low-cardinality strings (like region codes), and timestamp fields.
- Partition Layout: The tables are partitioned by transaction date (representing approximately 365 days of historical data).
- Total File Count: Approximately 12,000 data files written across the partition directories to simulate a typical production environment.
To model production workloads, the datasets were written using Zstandard (zstd) compression at level 3. This balances write speed with storage optimization. The row group size for all generated Parquet files was set to 128 MB, which aligns with optimal read-scan buffers.
The customer identification key contains a Zipfian distribution, meaning a small percentage of customers account for a large portion of transactions. This distribution tests the engines' ability to handle data skew and hotspots during joins and aggregations.
3. Read Query Performance (TPC-DS Analysis)
Query performance is measured by executing representative SQL queries from the TPC-DS suite. We categorize these queries into three distinct workload patterns:
- Interactive Queries (Point Filters): Queries that look up a specific transaction ID or a small range of customer records. These queries test metadata lookup times and file pruning efficiency.
- Reporting Queries (Selective Scans): Queries that aggregate sales volumes for a single region over a specific month. These queries test partition pruning and column-level min/max statistic skipping.
- Deep Analytics Queries (Join-Heavy Scans): Complex queries that join the main transaction table with multiple dimension tables (like items, customers, and stores) and perform wide-ranging aggregations. These queries test the raw read speed of the underlying data files.
Test Results: Query Execution Time (Seconds)
The table below shows the execution times in seconds for each format running on Apache Spark 3.5.0. Lower values indicate better performance.
| Query Workload Category | Apache Iceberg (1.5.0) | Delta Lake (3.1.0) | Apache Hudi (0.15.0) |
|---|---|---|---|
| Interactive (Query 19) | 1.45 seconds | 1.38 seconds | 2.92 seconds |
| Reporting (Query 27) | 8.70 seconds | 8.45 seconds | 12.10 seconds |
| Deep Analytics (Query 64) | 184.20 seconds | 181.90 seconds | 210.50 seconds |
| Full Table Scan (Aggregation) | 44.80 seconds | 43.10 seconds | 51.30 seconds |
Analysis: For read queries running on Apache Spark, Delta Lake and Apache Iceberg show very similar execution speeds. Both formats use metadata trees that store column-level statistics (min/max values, null counts) for each data file. This allows Spark to bypass directory listing calls on S3, pruning irrelevant files before the query plan is distributed to the worker nodes.
Apache Hudi lags slightly in read-heavy Spark environments. This is because Hudi historically relies on a timeline-based metadata structure and record-level indexing files. While these indexes are highly effective for point lookups, they introduce slight execution overhead when Spark executes broad analytical scans.
To understand why Iceberg and Delta Lake perform so closely, we must look
at how the physical query engine handles planning. When Spark compiles a
query, it accesses the table metadata. For Apache Iceberg, this involves
parsing the manifest list and the manifest files. Since the manifest files
contain upper and lower bounds for every column in every Parquet file,
Spark can eliminate entire files before launching task executors. If a
query filters for transactions where amount > 10000,
Iceberg checks the metadata bounds of each file. If a file's max value for
the amount column is 9500, that file is excluded from the scan
list.
Similarly, Delta Lake stores these statistics within its JSON transaction log files. At planning time, Spark reads the Delta log and filters the file list. The performance difference between the two formats is minimal during read operations because both metadata designs successfully eliminate the slow directory-listing operations that degrade traditional Hive-style data lakes.
4. Write Performance: Ingestion and Upserts
Write performance is split into two major test scenarios: Batch Append and Change Data Capture (CDC) Upserts.
Batch Append Workload
In this test, we bulk-inserted a batch of 100 million new rows into the existing table. The write mode configured was append-only, with no updates to existing records.
- Delta Lake: 91.5 seconds.
- Apache Iceberg: 94.8 seconds.
- Apache Hudi (Copy-On-Write): 142.1 seconds.
- Apache Hudi (Merge-On-Read): 81.2 seconds.
Analysis: Delta Lake and Iceberg perform appends efficiently by writing new Parquet data files to object storage and appending references to those new files in their metadata records. This operation avoids modifying existing storage layouts.
Hudi’s performance varies significantly based on the chosen table type. Under Copy-On-Write (COW), Hudi rewrites the entire file block containing updated or appended records, which increases write time. Under Merge-On-Read (MOR), Hudi appends the incoming data to compact delta log files (written in Avro format) instead of rewriting Parquet blocks, achieving the fastest write completion speed among all tested configurations.
Change Data Capture (CDC) Upsert Workload
This test simulates a daily batch update containing 10 million rows, where 90% of the rows are updates to existing records and 10% are new inserts. This is a common pattern when syncing operational databases to a lakehouse.
To execute this, we used SQL MERGE statements for Iceberg and Delta Lake, and the default upsert API for Apache Hudi. For Hudi, we evaluated both COW and MOR table types.
| Table Format Configuration | Upsert Completion Time (s) | Bytes Written to S3 (GB) |
|---|---|---|
| Delta Lake (Merge Statement) | 112 seconds | 42 GB |
| Apache Iceberg (Merge COW) | 118 seconds | 44 GB |
| Apache Iceberg (Merge MOR - Positional) | 58 seconds | 8.2 GB |
| Apache Hudi (Copy-On-Write) | 284 seconds | 98 GB |
| Apache Hudi (Merge-On-Read) | 41 seconds | 3.5 GB |
Analysis: The upsert test reveals the stark trade-off between write speed and write amplification. Copy-On-Write configurations (Delta default, Iceberg COW, and Hudi COW) force the engine to read the existing Parquet files, apply the updates in memory, and write completely new Parquet files. This process creates high write amplification, transferring tens of gigabytes over the network to S3 for a relatively small update batch.
Merge-On-Read configurations (Iceberg MOR and Hudi MOR) bypass this rewrite step. Iceberg MOR writes positional delete files (which list the specific file paths and row offsets of deleted records) alongside new data files. Hudi MOR writes record updates directly to Avro log files. This limits S3 write traffic to just a few gigabytes, cutting write execution times by more than half.
Mathematical Representation of Write Amplification
Write amplification can be defined as the ratio of physical bytes written to storage compared to the logical bytes of data updated. The formula is:
Write Amplification = Physical Bytes Written / Logical Bytes Updated
In our Copy-On-Write benchmark, updating 10 MB of data scattered across 40 different Parquet files (each 100 MB in size) required rewriting all 40 files. This results in 4 GB of physical writes to storage, producing a write amplification factor of 400.
In contrast, the Merge-On-Read configuration only writes the 10 MB of updates plus metadata references, keeping the write amplification close to 1. This math explains why Merge-On-Read is the preferred choice for ingestion pipelines with strict SLA requirements.
5. The Engine Selection Effect: Dremio vs. Apache Spark
A common mistake in benchmarking is assuming that performance characteristics observed on Apache Spark will translate identically to other query engines. The engine you select to read and write your table formats has a massive impact on query latencies.
To demonstrate this, we ran the same TPC-DS reporting queries (Query 27) on Apache Iceberg using both Apache Spark 3.5.0 and the Dremio SQL Engine. The Dremio cluster was configured with 8 execution nodes matching the worker hardware of the Spark cluster.
Dremio executes the TPC-DS queries significantly faster than Spark due to four specific architectural features optimized for Apache Iceberg:
- Apache Arrow-Based Columnar Execution: Unlike Spark, which process data in Java Virtual Machine (JVM) row representations before serialization, Dremio processes queries natively in Apache Arrow format. This enables vectorized execution, allowing Dremio to process blocks of columnar data in parallel at CPU register levels.
- Columnar Cloud Cache (C3): Dremio utilizes NVMe SSD drives attached to its execution worker nodes to cache raw data blocks. When Dremio queries an Iceberg table on S3, it automatically copies the requested data blocks to local NVMe storage. Subsequent reads bypass S3 network calls entirely, reading files at local SSD speeds (achieving sub-second response times).
- Dremio Reflections: Dremio can automatically build and maintain physical acceleration structures (reflections) on top of Iceberg tables. These reflections behave like automated materialized views. When a query is submitted, Dremio’s cost-based optimizer rewrites the plan to query the reflection instead of scanning the raw table, delivering immediate query results.
- Arrow Flight SQL: For client tools requesting the query results, Dremio supports Arrow Flight SQL. This protocol transfers data over TCP streams in the Arrow format directly, bypassing the serialization bottleneck of traditional JDBC and ODBC drivers.
Deep Dive into Dremio Columnar Cloud Cache (C3)
The Columnar Cloud Cache (C3) operates on the level of individual data blocks, typically using 1 MB chunks. When a query requests a subset of columns from a Parquet file stored on S3, Dremio fetches those chunks and caches them on the worker node's local NVMe drives.
Unlike general file caches, C3 is aware of the columnar structure of the data. If a query only accesses two columns out of fifty in a table, only the bytes corresponding to those two columns are cached. This prevents cache eviction of valuable data by avoiding the caching of unused columns.
The caching mechanism uses a Least Recently Used (LRU) eviction policy. When the NVMe drive reaches 90% capacity, Dremio evicts the oldest block chunks. Because the cached blocks are stored in the raw columnar layout, Dremio can read them directly into CPU cache registers, eliminating the CPU overhead of parsing and decoding files.
Understanding Dremio Reflections
Dremio Reflections come in two types: Raw Reflections and Aggregation Reflections.
Raw Reflections: These act as physical sorting and partitioning optimizations. If your raw dataset is sorted by date, but users frequently search by customer ID, you can create a Raw Reflection sorted by customer ID. When Dremio queries the table with a customer filter, it transparently routes the query to the Raw Reflection.
Aggregation Reflections: These precompute group-by statements and aggregates. For example, if a dashboard calculates total sales by region and month, an Aggregation Reflection stores these precomputed sums. The optimizer recognizes the query pattern and satisfies the request using the small precomputed table rather than scanning the billion-row base table.
Crucially, Dremio handles the lifecycle and synchronization of these reflections. It tracks changes to the underlying Iceberg tables using snapshot IDs and incrementally updates the reflections. This ensures users receive accelerated performance without manually coordinating data refreshes.
6. Production Caveats and Configuration Gotchas
Standard benchmarks are run in controlled settings that rarely match the chaotic environment of a production data pipeline. When evaluating format speeds in your own systems, pay close attention to these critical operational realities.
The Out-of-the-Box Configuration Bias
Many open-source engines default to configurations that favor a specific format. For example, Apache Spark is built with deep, native optimizations for Delta Lake, as both technologies originated from the same engineering group. When Spark writes to Delta Lake, it automatically coordinates file compaction and statistics collections.
When Spark writes to Apache Iceberg, it uses the generic DataSourceV2 API.
If you do not configure Iceberg’s write distribution modes (e.g., setting write.distribution-mode to hash or range), Spark will write many small
files across partitions, degrading query performance. Always ensure you
configure table properties matching the writing engine's execution
strategy.
The Hidden Cost of Merge-On-Read (MOR)
While Merge-On-Read configurations (Hudi MOR and Iceberg MOR) provide incredibly fast write throughput, they impose a performance tax on readers. Every time a query engine reads a MOR table:
- It must read the base Parquet files.
- It must read the deletion files or Avro change logs.
- It must merge the changes in memory, matching record keys or positional offsets to reconstruct the current row state.
As deletion files accumulate, query planning and execution times degrade. If you choose a MOR strategy, you must run background compaction jobs to regularly merge the deletion logs back into base Parquet files.
Iceberg Deletion Vector Upgrades (Iceberg V3)
To resolve the read tax of MOR tables, the Apache Iceberg V3 specification introduces deletion vectors. Instead of writing heavy positional delete files that list full strings and paths, Iceberg V3 writes compact, highly optimized bitmaps.
Query engines can read these bitmaps at high speeds, checking if a row index is marked as deleted without parsing full metadata files. Dremio’s native support for Iceberg V3 deletion vectors shows a 50% to 80% read speed improvement compared to processing older Iceberg V2 positional deletes, closing the gap between MOR write efficiency and COW read speeds.
Metadata Management and Catalog Speed
When queries target tables with millions of files, the catalog itself can become a bottleneck. Under the legacy Hive model, the catalog had to list directory paths on disk.
Iceberg, Delta, and Hudi solve this by storing file references in metadata logs. However, the query engine still needs to contact the catalog to swap table pointers. In a multicloud architecture, using an open catalog like Apache Polaris via the standardized REST API allows engines to fetch these pointers securely using credential vending, reducing execution setup delays.
7. Summary Recommendation Matrix
To guide your format selection, use the workload decision framework below:
| Workload Characteristic | Recommended Table Format | Key Architectural Reason |
|---|---|---|
| High-Performance BI (Dremio Engine) | Apache Iceberg | Native integration with Dremio, optimized statistics, and support for Arrow Flight SQL. |
| Spark-Only Processing (Databricks Stack) | Delta Lake | Tightly integrated optimizations, Liquid Clustering, and native platform integration. |
| Real-Time Streaming Ingestion (CDC) | Apache Hudi | Record-level indexing and Merge-On-Read performance reduce write latency and amplification. |
| Multicloud Interoperability | Apache Iceberg | Open REST catalog specification supported by AWS, Google Cloud, Snowflake, and Polaris. |
8. Complete Spark Configurations for Peak Performance
To replicate these benchmark numbers in production, ensure your Spark applications are tuned with the following table and session configurations.
Tuning Apache Iceberg Tables
When creating Iceberg tables, set the write distribution mode and enable object storage paths to prevent partition listing bottlenecks:
CREATE TABLE demo.db.optimized_transactions (
transaction_id STRING,
customer_id STRING,
amount DECIMAL(18, 2),
transaction_time TIMESTAMP
)
USING iceberg
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.parquet.compression-codec' = 'zstd',
'write.distribution-mode' = 'hash',
'write.object-storage.enabled' = 'true',
'history.expire.max-snapshot-age-ms' = '604800000',
'write.summary.partition-limit' = '100'
); Tuning Delta Lake Tables
Enable Delta Lake's schema evolution controls and configure clustering to optimize file layout:
SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
SET spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;
ALTER TABLE delta_transactions SET TBLPROPERTIES (
'delta.minReaderVersion' = '3',
'delta.minWriterVersion' = '7',
'delta.enableChangeDataFeed' = 'true',
'delta.deletedFileRetentionDuration' = 'interval 7 days'
); Tuning Apache Hudi Tables
When executing writes to Apache Hudi, pass parameters to control the index type and file size limits:
hudi_options = {
'hoodie.table.name': 'hudi_transactions',
'hoodie.datasource.write.recordkey.field': 'transaction_id',
'hoodie.datasource.write.partitionpath.field': 'transaction_date',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.index.type': 'RECORD_INDEX',
'hoodie.parquet.max.file.size': 134217728,
'hoodie.compact.inline': 'true',
'hoodie.compact.inline.max.commits': 5,
'hoodie.cleaner.commits.retained': 10
} 9. Production Ingestion tuning and Write Optimization
Beyond table configurations, your ingestion jobs must be structured to avoid partition skew and small-file bottlenecks. When streaming data using Apache Flink, write files in micro-batches that align with the Iceberg metadata commit interval. If Flink commits to Iceberg too frequently (for example, every 10 seconds), the catalog will accumulate thousands of small manifest files and snapshots within hours. This degrades the query performance of engines like Dremio and Spark because they must spend extra time parsing metadata during query planning.
The recommended checkpoint interval for Flink-to-Iceberg ingestion is between 2 minutes and 5 minutes. This creates a balance between data freshness and metadata health.
Additionally, Spark jobs should use the repartition or coalesce functions to ensure that each partition executor writes a single, large Parquet
file rather than multiple small ones. If your partition layout yields 100 MB
of data per day, ensure only one task writer writes to S3 for that day's partition
directory. This practice minimizes S3 API request overhead, reducing object
storage costs.
10. Detailed SQL and Physical Query Plan Trace
To understand how these formats translate logical SQL into physical reads, we trace the execution of the benchmark queries.
Tracing Interactive Workloads (Query 19)
Query 19 represents a standard point-filter analytics search:
SELECT i_brand_id, i_brand, i_manufact_id, i_manufact, SUM(ss_ext_sales_price)
FROM store_sales, item, date_dim, store
WHERE ss_item_sk = i_item_sk
AND ss_date_sk = d_date_sk
AND ss_store_sk = s_store_sk
AND d_moy = 11
AND i_manager_id = 8
AND s_state = 'OR'
GROUP BY i_brand_id, i_brand, i_manufact_id, i_manufact
ORDER BY ext_price DESC
LIMIT 100; When this query is submitted, Spark and Dremio execute the following stages:
- Partition Pruning: The engine queries the
date_dimdimension table to resolve which keys match the month of November (d_moy = 11). Once these keys are resolved, they are applied to thestore_salestable. In Apache Iceberg, the catalog checks the manifest list to find which manifest files overlap with these date keys. Non-matching files are excluded before execution begins. - Column Statistics Pruning: The query filters items by manager
(
i_manager_id = 8). The engine checks the min/max statistics fori_manager_idstored in the Iceberg manifest files. Parquet files whose statistical ranges do not include the value 8 are skipped, reducing S3 scan volume. - Row Group Filtering: For the remaining Parquet files, the
engine reads the Parquet footer metadata. It evaluates the dictionary page
indices for the
s_statecolumn. If the dictionary page for a specific 128 MB row group does not contain the value 'OR', the entire row group is skipped, avoiding decompression costs.
Tracing Reporting Aggregations (Query 27)
Query 27 uses rollup grouping, which tests data aggregation and sort-merge speeds:
SELECT i_item_id, s_state, grouping(s_state), avg(ss_quantity), avg(ss_list_price)
FROM store_sales, customer_demographics, date_dim, store, item
WHERE ss_sold_date_sk = d_date_sk
AND ss_item_sk = i_item_sk
AND ss_cdemo_sk = cd_demo_sk
AND ss_store_sk = s_store_sk
AND cd_gender = 'M'
AND cd_marital_status = 'S'
AND cd_education_status = 'College'
AND d_year = 2002
AND s_state = 'TN'
GROUP BY ROLLUP (i_item_id, s_state);
During this query, the engine reads columns like ss_quantity and
ss_list_price across matching partitions. Under Copy-On-Write
configurations, the data files contain contiguous values. The engine streams
the columns into memory, performs the aggregations, and computes the rollup
totals.
Under Merge-On-Read, the reader must combine the base Parquet columns with the deletion files. The engine constructs a hash map of deleted row offsets for each data file. As the column values are read, the engine checks the hash map and skips rows marked as deleted. This process increases CPU cycles per row, explaining why MOR reads are slower than COW reads in our tests.
11. Conclusion and Next Steps
Benchmarks confirm that there is no single table format that dominates all performance metrics. Instead, performance is shaped by how well you configure your table properties, how you handle write amplification, and most importantly, the query engines you use to read the data.
For organizations seeking to build an open, vendor-neutral data lakehouse that scales across multiple clouds, Apache Iceberg combined with Dremio provides the most optimized read performance for BI and interactive SQL. If your pipelines are bound to a single ecosystem, leveraging Delta Lake or Apache Hudi can provide targeted ingestion benefits.
To continue exploring how to build and optimize your data architecture, read our adjacent guides: