PyIceberg: Python Library for Apache Iceberg
PyIceberg is the official Python client library for Apache Iceberg, maintained as part of the Apache Iceberg project. It provides a pure-Python API for:
- Connecting to Iceberg catalogs (REST, Hive, Glue, Nessie)
- Reading Iceberg tables as PyArrow tables, Pandas DataFrames, or Dask/Ray collections
- Writing data to Iceberg tables
- Managing table schema, partitioning, and properties
- Running SQL queries against Iceberg tables (via DuckDB integration)
PyIceberg is the correct choice for Python data engineering workflows that don’t require Spark’s distributed processing. It’s significantly lighter weight, faster to set up, and more Python-idiomatic.
Installation
pip install "pyiceberg[s3fs,glue]" # AWS with S3 storage
pip install "pyiceberg[adlfs,azure]" # Azure
pip install "pyiceberg[gcs]" # GCP
pip install "pyiceberg[duckdb]" # Local with DuckDB SQL
Connecting to a Catalog
REST Catalog (Apache Polaris, Dremio Open Catalog, AWS Glue REST)
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"my_catalog",
**{
"type": "rest",
"uri": "https://my-catalog.example.com",
"credential": "client-id:client-secret",
}
)
AWS Glue Catalog
catalog = load_catalog(
"glue",
**{
"type": "glue",
"region_name": "us-east-1",
}
)
Local / Development (SQL Catalog with DuckDB)
catalog = load_catalog(
"local",
**{
"type": "sql",
"uri": "sqlite:///local_catalog.db",
"warehouse": "file:///tmp/iceberg-warehouse",
}
)
Reading Iceberg Tables
# Load a table
table = catalog.load_table("db.orders")
# Full table scan → PyArrow Table
arrow_table = table.scan().to_arrow()
# Convert to Pandas
df = arrow_table.to_pandas()
# Filter pushdown (predicates pushed to Iceberg manifest scanning)
from pyiceberg.expressions import GreaterThanOrEqual, LessThan, And
filtered = table.scan(
row_filter=And(
GreaterThanOrEqual("order_date", "2026-01-01"),
LessThan("order_date", "2026-06-01")
),
selected_fields=("order_id", "customer_id", "total"),
).to_arrow()
Writing Data
import pyarrow as pa
# Append new data
new_data = pa.table({
"order_id": [1001, 1002, 1003],
"customer_id": [42, 17, 99],
"total": [150.00, 289.99, 44.50],
"order_date": ["2026-05-14", "2026-05-14", "2026-05-14"],
})
table.append(new_data)
# Overwrite a partition
table.overwrite(new_data)
Time Travel Queries
# Load a specific snapshot by ID
snapshot = table.snapshot_by_id(8027658604211071520)
scan = table.scan(snapshot_id=snapshot.snapshot_id)
historical_data = scan.to_arrow()
# Load by timestamp
from datetime import datetime
snap = table.snapshot_as_of_timestamp(
int(datetime(2026, 1, 1).timestamp() * 1000) # milliseconds
)
SQL via DuckDB Integration
PyIceberg integrates with DuckDB for SQL-based querying:
import duckdb
# Register the Iceberg table with DuckDB
conn = duckdb.connect()
table = catalog.load_table("db.orders")
# Read via PyIceberg to Arrow, then query with DuckDB
arrow_table = table.scan().to_arrow()
conn.register("orders", arrow_table)
result = conn.execute("""
SELECT customer_id, SUM(total) as revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 10
""").fetchdf()
Schema and Metadata Operations
# Inspect table schema
print(table.schema())
# List all snapshots
for snap in table.snapshots():
print(snap.snapshot_id, snap.timestamp_ms, snap.operation)
# Inspect data files
for df in table.scan().plan_files():
print(df.file.file_path, df.file.record_count)
PyIceberg and the Agentic Lakehouse
PyIceberg is the natural integration point for AI agents and LLM-driven data workflows:
- MCP servers: AI agent frameworks can use PyIceberg to inspect Iceberg table schemas, run queries, and return results to LLMs.
- LangChain tools: PyIceberg can be wrapped as a LangChain tool for natural-language-to-Iceberg-query workflows.
- Data pipeline automation: Python-based orchestration frameworks (Airflow, Prefect) use PyIceberg for catalog management without Spark dependencies.
For Python-first AI and data engineering teams, PyIceberg is the fastest path to Iceberg integration.