Skip to content
File & Metadata Layer Last updated: May 14, 2026

Apache Parquet and Iceberg

Apache Parquet is the default and recommended data file format for Apache Iceberg tables, providing columnar storage, rich compression, and column-level statistics that complement Iceberg's manifest-level data skipping for maximum query performance.

apache parquet icebergiceberg parquet formatparquet columnar storageiceberg default file formatparquet row groups

Apache Parquet and Iceberg

Apache Parquet is the default and overwhelmingly dominant data file format for Apache Iceberg tables. While Iceberg technically supports ORC and Avro as well, Parquet is the de-facto standard in the lakehouse ecosystem — and understanding how Parquet and Iceberg work together is essential for building high-performance lakehouse architectures.

Parquet provides columnar storage with rich compression and column-level statistics. Iceberg provides a metadata layer above Parquet with manifest-level statistics and snapshot management. Together, they deliver a two-level data skipping pipeline that can reduce query scan sizes by orders of magnitude.

Apache Parquet: A Quick Overview

Apache Parquet is an open-source, column-oriented data file format designed for efficient analytical processing. Key properties:

How Parquet and Iceberg Complement Each Other

Iceberg and Parquet create a two-level data skipping pipeline:

Level 1: Iceberg Manifest-Level Skipping

Iceberg stores column-level min/max statistics for each entire Parquet file in the manifest file. The query engine can skip entire files before opening them if the column statistics prove no matching rows can exist.

Level 2: Parquet Row-Group-Level Skipping

Within a Parquet file that wasn’t skipped at the manifest level, the Parquet reader applies row group filtering using the per-row-group statistics embedded in the Parquet file footer. This skips row groups within a file before reading their pages.

Result: Two-Level Pruning

For a well-clustered table, a query like WHERE event_date = '2026-05-14' might:

  1. Skip 99.9% of files via Iceberg manifest statistics.
  2. Within the remaining files, skip most row groups via Parquet row group statistics.
  3. Read only the matching pages within non-skipped row groups.

This two-level approach makes well-designed Iceberg/Parquet tables competitive with or superior to columnar data warehouses for analytical query performance.

Optimal Parquet File Size

For Iceberg tables, the recommended Parquet file size is 128MB to 512MB per file:

File size is controlled by compaction settings and write batch sizes.

Parquet Compression

Recommended compression for Iceberg Parquet files:

CodecSpeedRatioUse Case
SnappyFastModerateGeneral purpose, default
ZstdFastHighBest balance of compression + speed
GzipSlowHighestArchival, rarely written
UncompressedFastestNoneDebugging only

Zstd is increasingly the recommended default for production Iceberg tables.

Parquet Column Statistics and Iceberg

When Iceberg writes a data file, it reads column statistics from the Parquet file footer and stores them in the manifest entry for that file. This is how Iceberg populates the lower_bounds and upper_bounds fields used for data skipping at the manifest level.

The quality of data skipping depends on data clustering — how well the data within each file is sorted or ordered. A file with event_date values randomly distributed between 2020 and 2026 has very wide min/max bounds and provides poor skipping. A file with all event_date values within a single day provides tight bounds and excellent skipping.

This is why clustering and sorting (see Z-Order Clustering) is so important for Iceberg query performance.

Parquet and Iceberg in Practice

# PyIceberg: writing Parquet with specific properties
from pyiceberg.catalog import load_catalog
import pyarrow as pa

catalog = load_catalog("my_catalog")
table = catalog.load_table("db.events")

# Write with Zstd compression
table.append(
    pa.table({"event_id": [...], "event_time": [...]}),
    write_options={"compression": "zstd"}
)

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base