Knowledge Base
Apache Iceberg Reference
A definitive, authoritative reference for every major Apache Iceberg concept — from the core table format and metadata layer to catalogs, query engines, operational patterns, and agentic data architectures. Each entry is written to be a standalone resource, deeply interlinked across the knowledge base.
115 terms across 9 categories
Core Concepts
- ACID Transactions in Apache Iceberg Apache Iceberg delivers full ACID transaction guarantees on object storage through optimistic concurrency control and at …
- Apache Iceberg Spec v1 vs v2 Apache Iceberg Spec v2 introduced row-level deletes (delete files), sequence numbers, required field tracking, and impro …
- Apache Iceberg Spec v3 Apache Iceberg Spec v3 introduces deletion vectors for more efficient row-level deletes, the Variant data type for semi- …
- Apache Iceberg Spec v4 (Current State) Apache Iceberg Spec v4 is in early community discussion and proposal stages as of 2025, with potential features includin …
- Apache Iceberg Table Format The Apache Iceberg table format is a specification defining how data files, metadata files, manifests, and snapshots are …
- Apache Iceberg vs Apache Hudi Apache Iceberg and Apache Hudi are both open table formats for cloud lakehouses — Iceberg prioritizes multi-engine inter …
- Apache Iceberg vs Delta Lake Apache Iceberg and Delta Lake are the two dominant open table formats for cloud lakehouses — Iceberg offers superior mul …
- Data Lakehouse A data lakehouse is a modern data architecture that combines the low-cost, scalable storage of a data lake with the reli …
- Hidden Partitioning in Apache Iceberg Hidden partitioning in Apache Iceberg separates the physical partition layout from the logical table schema, allowing th …
- Iceberg Column Mapping Iceberg column mapping decouples the logical column names in the schema from the physical field names in data files usin …
- Iceberg Deletion Vectors Deletion vectors are a Spec v3 enhancement to Apache Iceberg's row-level delete mechanism, replacing positional delete f …
- Iceberg Open Table Format vs. Delta Lake vs. Apache Hudi Apache Iceberg, Delta Lake, and Apache Hudi are the three dominant open table formats competing to be the storage founda …
- Iceberg Sequence Number The Iceberg sequence number is a monotonically increasing integer assigned to each snapshot and each data/delete file, i …
- Iceberg Snapshot References Iceberg snapshot references are named pointers (branches and tags) stored in the table metadata that reference specific …
- Iceberg Sort Order An Iceberg sort order is a table-level specification stored in metadata that defines how data should be physically order …
- Iceberg Table Properties Iceberg table properties are key-value configuration settings stored in the table metadata that control write behavior, …
- Iceberg Table Statistics (Puffin) Iceberg table statistics are advanced column-level metrics — including NDV (number of distinct values) estimates using A …
- Iceberg Views Apache Iceberg Views are named, stored SQL queries managed by the Iceberg catalog that appear as virtual tables to downs …
- Open Table Format Comparison (Iceberg, Delta Lake, Hudi, Paimon) A comprehensive comparison of the four major open table formats — Apache Iceberg, Delta Lake, Apache Hudi, and Apache Pa …
- Partition Evolution in Apache Iceberg Partition evolution in Apache Iceberg lets you change a table's partitioning scheme at any time without rewriting existi …
- Schema Evolution in Apache Iceberg Schema evolution in Apache Iceberg allows you to safely add, drop, rename, reorder, and widen columns in a table without …
- Time Travel in Apache Iceberg Time travel in Apache Iceberg lets you query a table as it existed at any past snapshot or timestamp, enabling reproduci …
- What is Apache Iceberg? Apache Iceberg is an open, high-performance table format for huge analytic datasets stored in data lakes, enabling ACID …
File & Metadata Layer
- Apache Iceberg ORC Format Apache ORC (Optimized Row Columnar) is an alternative columnar storage format supported by Apache Iceberg alongside Parq …
- Apache Iceberg Snapshot An Iceberg snapshot is an immutable, point-in-time view of a table's complete data state, recorded as a manifest list th …
- Apache Parquet and Iceberg Apache Parquet is the default and recommended data file format for Apache Iceberg tables, providing columnar storage, ri …
- Iceberg Avro Metadata Format Apache Avro is the metadata file format used for all Apache Iceberg manifest files and manifest lists, providing schema- …
- Iceberg Data Files Iceberg data files are the immutable columnar files (Parquet, ORC, or Avro) that store the actual table data in object s …
- Iceberg Delete Files Iceberg delete files record row-level deletions without rewriting data files, enabling efficient UPDATE, DELETE, and MER …
- Iceberg Encryption Apache Iceberg supports column-level and file-level encryption through its encryption specification, enabling sensitive …
- Iceberg Equality Deletes Equality delete files in Apache Iceberg record column values identifying rows to be deleted, enabling row-level deletes …
- Iceberg FileIO API The Iceberg FileIO API is an abstraction layer that decouples the Iceberg table format from specific storage system impl …
- Iceberg Manifest File An Iceberg manifest file is an Avro metadata file that tracks a subset of an Iceberg table's data files, recording each …
- Iceberg Manifest List An Iceberg manifest list is a file associated with each snapshot that lists all the manifest files making up that snapsh …
- Iceberg Metadata File The Iceberg metadata file (metadata.json) is the top-level entry point for an Iceberg table, recording the full history …
- Iceberg Positional Deletes Positional delete files in Apache Iceberg record the exact file path and row position of deleted rows, enabling efficien …
- Iceberg Puffin Files Puffin is the Apache Iceberg file format for storing advanced table statistics and indexes beyond the basic min/max boun …
Catalogs
- Apache Gravitino Apache Gravitino is an open-source multi-source metadata hub that provides unified metadata management across heterogene …
- Apache Polaris Catalog Apache Polaris is an open-source implementation of the Apache Iceberg REST Catalog specification, co-created by Dremio a …
- AWS Glue Catalog for Apache Iceberg AWS Glue Data Catalog is Amazon's managed metadata catalog service with native support for Apache Iceberg tables via the …
- Hive Metastore Catalog for Iceberg The Hive Metastore (HMS) is the original Iceberg catalog implementation, using a relational database to store Iceberg ta …
- Iceberg Catalog Migration Iceberg catalog migration moves tables between catalog implementations (HMS to Polaris, Glue to Nessie, JDBC to REST Cat …
- Iceberg JDBC Catalog The Iceberg JDBC Catalog uses any JDBC-compatible relational database (PostgreSQL, MySQL, SQLite) as a persistent metada …
- Iceberg Multi-Catalog Architecture Multi-catalog architectures in Apache Iceberg use multiple catalog instances to achieve environment isolation, domain se …
- Iceberg REST Catalog The Iceberg REST Catalog is a standardized HTTP API specification for Apache Iceberg catalog operations, enabling any en …
- Iceberg REST Catalog API Reference The Apache Iceberg REST Catalog specification defines a standardized HTTP API for catalog operations — namespace managem …
- Project Nessie Project Nessie is an open-source transactional metadata catalog for Apache Iceberg with Git-like branching and merging s …
- Snowflake Open Catalog Snowflake Open Catalog is a managed Apache Polaris service offered by Snowflake that provides a vendor-neutral Iceberg R …
- What is an Iceberg Catalog? An Apache Iceberg catalog is the service responsible for tracking the current metadata file location for each Iceberg ta …
Operations & Optimization
- Copy-on-Write (CoW) in Iceberg Copy-on-Write (CoW) is an Iceberg write mode where UPDATE and DELETE operations rewrite entire affected data files to pr …
- Expire Snapshots in Apache Iceberg Expiring snapshots in Apache Iceberg is the maintenance operation that removes old snapshot metadata (and optionally the …
- Iceberg Bloom Filters Bloom filter indexes in Apache Iceberg enable probabilistic row-level skipping by allowing query engines to determine wi …
- Iceberg Branching and Tagging Iceberg table branches and tags are named references to specific snapshots or independent snapshot chains, enabling Git- …
- Iceberg Concurrent Write Handling Apache Iceberg uses optimistic concurrency control with atomic catalog commits to safely handle multiple simultaneous wr …
- Iceberg Cost Optimization Cost optimization for Apache Iceberg lakehouses targets storage costs (snapshot expiration, compression, tiering), compu …
- Iceberg Data Skipping Data skipping in Apache Iceberg is the multi-level mechanism by which query engines eliminate irrelevant files and row g …
- Iceberg Incremental Reads Iceberg incremental reads enable processing only the new or changed data between two snapshots by using the snapshot dif …
- Iceberg Maintenance Scheduling Production Apache Iceberg maintenance requires scheduling compaction, snapshot expiration, orphan file cleanup, and mani …
- Iceberg Orphan Files Orphan files in Apache Iceberg are data files written to object storage during failed transactions that were never commi …
- Iceberg Performance Tuning Guide A comprehensive guide to optimizing Apache Iceberg query and write performance, covering partition pruning effectiveness …
- Iceberg Predicate Pushdown Predicate pushdown in Apache Iceberg propagates WHERE clause filter conditions from the query layer down through the man …
- Iceberg Rewrite Manifests Rewriting Iceberg manifests is a maintenance operation that consolidates many small manifest files into fewer, larger on …
- Iceberg Table Clustering Table clustering in Apache Iceberg co-locates related rows within the same data files to maximize column statistics sele …
- Iceberg Table Compaction Iceberg compaction is the maintenance process of merging small data files into optimally sized files, applying pending d …
- Iceberg Table Design Best Practices Iceberg table design best practices cover partition strategy, sort order selection, file format and compression choices, …
- Iceberg Table Rollback Rolling back an Apache Iceberg table reverts its current state to a prior snapshot, effectively undoing all writes since …
- Iceberg Upsert (MERGE INTO) Iceberg upsert operations using MERGE INTO enable atomic insert-or-update workflows against Iceberg tables, implementing …
- Iceberg Write Distribution Modes Iceberg write distribution modes control how data is distributed across parallel write tasks before being written to out …
- Merge-on-Read (MoR) in Iceberg Merge-on-Read (MoR) is an Iceberg write strategy where UPDATE and DELETE operations write small delete files instead of …
- Row-Level Deletes in Apache Iceberg Row-level deletes in Apache Iceberg enable precise removal or modification of individual rows within existing data files …
- Small File Problem in Apache Iceberg The small file problem in Apache Iceberg occurs when frequent write transactions generate many small Parquet files, degr …
- Z-Order Clustering in Apache Iceberg Z-Order (or Z-curve) clustering in Apache Iceberg is a multi-dimensional data layout optimization that co-locates rows w …
Engines & Integrations
- Apache Airflow and Apache Iceberg Apache Airflow is the most widely used workflow orchestration platform for Iceberg data pipelines, providing scheduling, …
- Apache Doris and Apache Iceberg Apache Doris is a high-performance real-time analytical database with native Iceberg external catalog support, enabling …
- Apache Flink and Apache Iceberg Apache Flink is the leading stream processing engine for Apache Iceberg, enabling real-time data ingestion with exactly- …
- Apache Kafka and Apache Iceberg Apache Kafka and Apache Iceberg form the backbone of real-time lakehouse pipelines — Kafka provides the event streaming …
- Apache Spark and Apache Iceberg Apache Spark is the most feature-complete query engine for Apache Iceberg, providing full DDL, DML, time travel, stored …
- Apache Superset and Apache Iceberg Apache Superset is the leading open-source business intelligence tool that queries Apache Iceberg tables through SQL con …
- Databricks and Apache Iceberg Databricks supports Apache Iceberg through UniForm (Delta-to-Iceberg automatic metadata generation) and native Iceberg c …
- dbt and Apache Iceberg dbt (data build tool) transforms raw Iceberg table data into clean, tested, documented analytical models using SQL, with …
- Dremio and Apache Iceberg Dremio is an Agentic Lakehouse platform that provides a fully integrated Iceberg experience through its Intelligent Quer …
- DuckDB and Apache Iceberg DuckDB is an embedded analytical database with a native Apache Iceberg extension that enables direct, high-performance S …
- Hive and Apache Iceberg Apache Hive 4.x has native Iceberg support, enabling Hive SQL to read and write Iceberg tables as first-class objects wh …
- Presto and Apache Iceberg PrestoDB is the Meta-maintained fork of the original Presto query engine with an Iceberg connector that supports Iceberg …
- PyIceberg: Python Library for Apache Iceberg PyIceberg is the official Python library for Apache Iceberg, providing a pure-Python client for reading, writing, and ma …
- Snowflake Iceberg Tables Snowflake Iceberg Tables let organizations store Iceberg data in their own object storage (external volumes) while using …
- StarRocks and Apache Iceberg StarRocks is a high-performance OLAP query engine with native Apache Iceberg external table support via its Multi-Catalo …
- Trino and Apache Iceberg Trino (formerly PrestoSQL) is a distributed SQL query engine with native Apache Iceberg support, optimized for interacti …
Agentic & AI
- Agentic Lakehouse An Agentic Lakehouse is a data lakehouse architecture purpose-built for AI agents and autonomous analytics, combining op …
- Iceberg AI Readiness Iceberg AI readiness describes the architectural properties that make Apache Iceberg tables ideal for AI and machine lea …
- Iceberg AI Semantic Layer The AI Semantic Layer on Apache Iceberg translates raw Iceberg table data into AI-understandable business context throug …
- Iceberg Apache Arrow Flight Apache Arrow Flight provides a high-throughput, low-latency RPC protocol for transferring Apache Arrow columnar data fro …
- Iceberg Feature Store Apache Iceberg is used as the offline storage layer in ML feature stores, providing point-in-time correct feature retrie …
- Iceberg LLM Grounding and RAG for Structured Data LLM grounding with Apache Iceberg uses governed, versioned Iceberg tables as the authoritative data source for LLM respo …
- Iceberg Natural Language Analytics Natural language analytics on Apache Iceberg enables business users and AI agents to ask questions in plain English and …
- LangChain and Apache Iceberg LangChain agents can query Apache Iceberg lakehouses using SQL tools and Arrow Flight connections, enabling natural lang …
- MCP and Apache Iceberg Model Context Protocol (MCP) servers for Apache Iceberg enable AI agents and LLMs to discover, query, and reason over Ic …
Cloud-Specific Integrations
- Amazon EMR and Apache Iceberg Amazon EMR (Elastic MapReduce) is AWS's managed Spark and Flink cluster service that supports Apache Iceberg as a first- …
- Amazon S3 Tables for Apache Iceberg Amazon S3 Tables is an AWS managed service that provides Apache Iceberg table storage and catalog directly within Amazon …
- AWS Athena and Apache Iceberg Amazon Athena is a serverless SQL query engine with native Apache Iceberg support via the AWS Glue Data Catalog, enablin …
- BigQuery and Apache Iceberg Google BigQuery supports Apache Iceberg tables through BigLake managed tables and Biglake Metastore, enabling BigQuery S …
- Google Cloud and Apache Iceberg Google Cloud's Apache Iceberg stack integrates BigQuery, Cloud Storage, Biglake Metastore, and Cloud Dataplex to provide …
- Microsoft Fabric and Apache Iceberg Microsoft Fabric supports Apache Iceberg tables through OneLake's open format integration and mirrored Fabric tables, en …
Governance & Security
- Iceberg Access Control Patterns Iceberg access control is implemented at the catalog layer through the Iceberg REST Catalog RBAC model, providing namesp …
- Iceberg Audit Logging Iceberg audit logging captures a complete record of all catalog interactions, table reads, write commits, schema changes …
- Iceberg Data Lineage Iceberg data lineage is the ability to trace the origin, transformation history, and downstream consumption of data in I …
- Iceberg Data Masking Data masking in Apache Iceberg protects sensitive column values from unauthorized consumers by applying masking function …
- Iceberg Multi-Tenancy Patterns Multi-tenancy in Apache Iceberg isolates multiple tenants, teams, or environments in a shared lakehouse using namespace …
Patterns & Architecture
- Iceberg CDC (Change Data Capture) CDC with Apache Iceberg enables real-time synchronization of operational database changes (inserts, updates, deletes) in …
- Iceberg Data Mesh Architecture A data mesh on Apache Iceberg uses Iceberg tables as the storage standard for domain-owned data products, with the Icebe …
- Iceberg Lakehouse Federation Iceberg lakehouse federation enables querying Iceberg tables across multiple catalogs, cloud environments, and storage p …
- Iceberg Streaming Ingestion Iceberg streaming ingestion is the pattern of continuously writing data from event streams, Kafka topics, and CDC feeds …
- Iceberg Table Migration from Hive Migrating from Apache Hive tables to Apache Iceberg converts existing Parquet files into Iceberg-managed tables with ful …
- Medallion Architecture with Apache Iceberg The Medallion Architecture (Bronze/Silver/Gold) is a multi-layer data organization pattern where raw data flows through …
- Write-Audit-Publish (WAP) Pattern The Write-Audit-Publish (WAP) pattern is a data pipeline quality assurance workflow using Apache Iceberg branches to wri …
📚 Go Deeper on Apache Iceberg
Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.