Apache Iceberg Explained
Apache Iceberg is an open table format for large analytic tables stored in object storage. It was created at Netflix in 2017, open-sourced in 2018, and graduated to a top-level Apache Software Foundation project in 2020. Today it is the most widely adopted open table format across cloud providers, query engines, and data platforms.
At its core, Iceberg solves a problem that plagued Hive-style data lakes: there was no reliable way to know exactly which files belonged to a table at any given moment, which made concurrent writes dangerous, schema changes painful, and consistent reads nearly impossible at scale. Iceberg replaces that folder-based approach with a proper metadata system that tracks every file, every schema change, and every transaction.
The Problem Iceberg Solves
In the Hive model, a table is a directory. Writers add files to that directory, and readers scan whatever is there. This causes three practical problems:
- No atomicity. A reader that starts a scan while a writer is adding files may see a partially-complete write. The data is inconsistent.
- Partition discovery is slow. Finding all the partitions in a large table requires listing every subdirectory in object storage, which is expensive and slow.
- Schema changes break things. Renaming or reordering columns in a Hive table often requires rewriting all the data files.
Iceberg addresses all three with a metadata-first design. The table is not a directory. It is a pointer to a metadata file that describes the complete table state.
The Core Abstraction: Snapshots and Manifests
Iceberg tracks table state through a hierarchy of metadata files. Each layer summarizes the one below it, which lets query engines do most of their work at the metadata level before touching any actual data.
(table name → metadata pointer)"] A --> B["Table Metadata JSON
(schema, partition spec, snapshot list)"] B --> C["Manifest List (Avro)
(one entry per manifest file, with partition summaries)"] C --> D1["Manifest File (Avro)
(one entry per data file, with column stats)"] C --> D2["Manifest File (Avro)
(one entry per data file, with column stats)"] D1 --> E1["Data File (Parquet)
actual rows"] D1 --> E2["Data File (Parquet)
actual rows"] D2 --> E3["Data File (Parquet)
actual rows"]
When a query arrives, the engine reads the catalog to find the metadata file, reads the metadata to find the current snapshot's manifest list, reads each manifest to find which data files match the query filters (using the per-file min/max statistics), and only then reads the actual Parquet data. A query that filters on a date column may skip 99% of the data files without opening them.
For a deeper walkthrough of this structure, see the Apache Iceberg Architecture guide.
Key Capabilities
ACID Transactions
Iceberg uses optimistic concurrency control. A writer reads the current table state, makes changes, and atomically swaps the metadata pointer to a new snapshot. Readers always see a complete, consistent snapshot. Concurrent writers retry if a conflict occurs.
Schema Evolution Without Data Rewrites
Iceberg assigns every column a numeric ID rather than relying on column names. This means you can rename a column, add a column, or reorder columns without touching a single data file. The old files still work correctly because Iceberg maps the column IDs to the new names at read time.
Hidden Partitioning
In Hive, users have to know the partition columns and include them in
every query to get partition pruning. Iceberg handles this automatically.
You declare a partition transform (days(order_date), for
example), and Iceberg applies it at write time and pruning at read time
without exposing partition columns to query authors.
Time Travel and Rollback
Every commit creates an immutable snapshot. You can query any past snapshot by timestamp or snapshot ID, compare two snapshots to see what changed, and roll back a table to any previous state with a single metadata operation (no data is rewritten on rollback).
Partition Evolution
You can change a table's partition scheme without rewriting existing data files. Old files are queried with the old partition layout, new files use the new layout. Iceberg's query planner handles both transparently.
How a Write Commit Works
The catalog commit is the atomic step. If two writers try to commit against the same metadata version, only one succeeds. The other retries from the current state. This is what gives Iceberg serializable isolation without requiring a distributed lock.
Engine Support
One reason Iceberg has become the default open table format is that virtually every query engine now supports it natively.
| Engine | Read | Write | Notes |
|---|---|---|---|
| Apache Spark | Yes | Yes | Most complete integration; CoW + MoR |
| Apache Flink | Yes | Yes | Streaming sink with exactly-once |
| Trino | Yes | Yes | Full DML support |
| Dremio | Yes | Yes | Native with AI Semantic Layer |
| AWS Athena | Yes | Yes | Native via AWS Glue catalog |
| Google BigQuery | Yes | Yes | BigLake managed Iceberg tables |
| Snowflake | Yes | Yes | Iceberg tables + Open Catalog (Polaris) |
| DuckDB | Yes | Partial | Via iceberg extension |
| PyIceberg | Yes | Yes | Python-native client library |
| StarRocks / Doris | Yes | Yes | Full native support |
The Iceberg Catalog and REST API
Catalogs are how engines find tables. Iceberg defines a standard REST Catalog API that any catalog implementation exposes: create/read/update/delete tables, list namespaces, commit snapshots, and vend storage credentials. Because the API is an open standard, any engine that implements it can talk to any catalog.
Apache Polaris, co-created by Dremio and Snowflake, is the reference implementation of this REST Catalog standard. Other catalogs that implement the same spec include Project Nessie, AWS Glue, and Snowflake Open Catalog.
For the full API reference, see the Iceberg REST Catalog guide.
Where Iceberg Fits in the Data Stack
(Kafka, CDC, batch ETL)"] --> B["Iceberg Tables
(Bronze / Silver / Gold)"] B --> C["Catalog
(Apache Polaris / Glue / Nessie)"] C --> D1["Analytics
(Dremio, Trino, Athena)"] C --> D2["ML / AI
(Spark, PyIceberg, DuckDB)"] C --> D3["AI Agents
(MCP, LangChain, Dremio AI Agent)"] D1 --> E["BI Tools
(Superset, Tableau, Power BI)"]
When Not to Use Iceberg
Iceberg is designed for large analytical tables. It adds overhead that is not worth it for small reference tables with a few thousand rows (use a regular database). It is also not a good fit when you need sub-millisecond point lookups by primary key (use a transactional database or key-value store). And if all your compute runs inside a single managed warehouse that already handles table management, adding Iceberg may create more complexity than value.
Go Deeper
- Apache Iceberg Architecture — the full metadata tree, commit flow, and concurrency model
- Iceberg vs Delta Lake vs Apache Hudi — when to choose each format
- Iceberg REST Catalog and Apache Polaris — the catalog API that enables multi-engine access
- Snapshots and Time Travel — how Iceberg manages history
- Apache Iceberg Knowledge Base — 115 technical reference pages