Skip to content
File & Metadata Layer Last updated: May 14, 2026

Iceberg Puffin Files

Puffin is the Apache Iceberg file format for storing advanced table statistics and indexes beyond the basic min/max bounds in manifest files, including NDV (number of distinct values) sketches, theta sketches, and bloom filters that enable more accurate query planning.

iceberg puffin filesiceberg puffin formaticeberg advanced statisticsiceberg ndv statisticsiceberg bloom filter puffin

Iceberg Puffin Files

Puffin is the Apache Iceberg file format for storing advanced table-level statistics and indexes that go beyond the column min/max bounds available in manifest files. Puffin files attach supplementary statistical metadata to Iceberg table snapshots, enabling query planners to make better cost-based optimization decisions — such as accurate join ordering, smarter partition elimination, and bloom-filter-based row skipping.

The name “Puffin” is deliberately playful — following Iceberg’s arctic theme — and refers to the bird species that uses the same name.

Why Puffin Exists

Manifest files store per-file column statistics: min/max values, null counts, value counts. These are powerful for data skipping but have limitations:

Puffin adds a dedicated file format to attach these richer statistics to snapshots — separate from manifests, and extensible for future statistics types.

Puffin File Structure

A Puffin file is a binary format with:

Each blob in a Puffin file has:

Supported Statistics Types

Apache DataSketches Theta Sketch (NDV)

Estimates the number of distinct values (NDV) for a column using the Theta sketch algorithm from the Apache DataSketches library. NDV is critical for join cardinality estimation.

blob type: "apache-datasketches-theta-v1"
→ answers: "approximately how many distinct values does customer_id have?"
→ use: join ordering, GROUP BY cardinality estimation

Apache DataSketches HLL Sketch

The HyperLogLog++ sketch — another NDV estimation algorithm with different accuracy/size tradeoffs.

blob type: "apache-datasketches-hll-v1"

Bloom Filter Index (Future / In Progress)

File-level bloom filters stored in Puffin would allow the engine to determine “does this data file contain a row where user_id = 12345?” with a single hash lookup — eliminating files that can prove they don’t contain a value.

Puffin Files and the Snapshot

Puffin files are associated with a snapshot via the snapshot’s statistics-files property in the table metadata:

{
  "snapshot-id": 8027658604211071520,
  "statistics": [
    {
      "snapshot-id": 8027658604211071520,
      "statistics-path": "s3://bucket/warehouse/db/orders/metadata/snap-8027...puffin",
      "file-size-in-bytes": 16384,
      "file-footer-size-in-bytes": 512,
      "blob-metadata": [...]
    }
  ]
}

When a snapshot is expired, its associated Puffin files are also cleaned up.

Generating Puffin Statistics

Puffin statistics must be explicitly computed — they are not generated during normal writes. In Spark:

-- Analyze a table to compute and store column statistics as Puffin
ANALYZE TABLE db.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- Verify statistics were written
SELECT * FROM db.orders.snapshots;
-- look for statistics-files in the snapshot metadata

In Dremio Cloud and Enterprise, statistics collection can be triggered via the UI or API and is used by the Intelligent Query Engine’s cost-based optimizer.

Puffin and Query Planning

Engines that support Puffin statistics use them in their query planners:

Puffin is an evolving area of the Iceberg spec — expect bloom filter support, histogram statistics, and multi-column statistics to emerge as the ecosystem matures.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base