Skip to content
Core Concepts Last updated: May 14, 2026

Iceberg Table Statistics (Puffin)

Iceberg table statistics are advanced column-level metrics — including NDV (number of distinct values) estimates using Apache DataSketches — stored in Puffin files and used by cost-based query optimizers to improve join ordering, cardinality estimation, and query planning accuracy.

iceberg statisticsiceberg ndv statisticsiceberg cost based optimizericeberg analyze tableiceberg column statistics

Iceberg Table Statistics

Apache Iceberg supports two levels of statistics for query optimization:

  1. Per-file manifest statistics (available in all tables): min/max bounds, null counts, value counts per column per data file — stored in manifest entries. Used for data skipping.

  2. Table-level statistics (via Puffin files): advanced aggregate statistics across the entire table — NDV estimates, histograms, bloom filter indexes — stored in Puffin files attached to snapshots. Used for cost-based query planning.

This page covers the table-level statistics layer — the more advanced layer that requires explicit computation.

Why Table-Level Statistics Matter

Per-file min/max statistics tell query engines which files to skip. Table-level statistics tell query engines how to plan the query — specifically how to estimate the cardinality of operations:

Statistics Types

NDV (Number of Distinct Values)

The most important table statistic. Answers: “How many unique values does customer_id have across the entire table?”

Stored as an Apache DataSketches Theta sketch or HLL (HyperLogLog++) sketch in a Puffin file blob. These are probabilistic data structures that approximate NDV within a tunable error bound (~3% error with typical settings) using very compact memory.

NDV(customer_id) ≈ 1,250,000 distinct customers
NDV(status) ≈ 5 (pending, processing, shipped, delivered, cancelled)
NDV(product_id) ≈ 48,320 distinct products

A query planner uses these estimates to:

Computing Statistics

Statistics must be explicitly computed — they don’t update automatically on each write.

Apache Spark

-- Compute NDV statistics for all columns
ANALYZE TABLE db.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- Compute for specific columns only
ANALYZE TABLE db.orders COMPUTE STATISTICS FOR COLUMNS customer_id, product_id, order_date;

PyIceberg (programmatic)

from pyiceberg.catalog import load_catalog

catalog = load_catalog("my_catalog", **{...})
table = catalog.load_table("db.orders")

# Statistics computation requires a compute engine
# Use Spark via PySpark for the actual computation

Dremio

Dremio’s Intelligent Query Engine can compute and use Iceberg statistics automatically as part of its cost-based optimization. Statistics can be refreshed via:

-- Dremio: refresh table metadata including statistics
REFRESH TABLE METADATA db.orders;

Inspecting Statistics

-- Spark: view statistics attached to current snapshot
SELECT * FROM db.orders.snapshot_statistics;

-- Or inspect the snapshot metadata directly
SELECT snapshot_id, statistics
FROM db.orders.snapshots
WHERE snapshot_id = (SELECT current_snapshot_id FROM db.orders.metadata_log LIMIT 1);

Statistics Lifecycle

Statistics are tied to a specific snapshot. When the table is updated:

Best practice: Recompute statistics:

Statistics and the Query Planning Pipeline

The full statistics hierarchy a query planner uses:

1. Table-level statistics (Puffin NDV)
   → Join ordering, memory allocation, subquery planning

2. Manifest-level statistics (min/max per file)
   → Partition pruning, file-level data skipping

3. Parquet row group statistics (min/max per row group)
   → Sub-file row group skipping

4. Actual data read
   → Row-level predicate evaluation

Each layer provides a progressively finer-grained filter, ensuring queries touch the minimum amount of data at every stage.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base