Iceberg Performance Tuning Guide
Iceberg performance problems typically fall into one of three categories: slow reads (poor data skipping or excessive file count), slow writes (suboptimal distribution or configuration), or slow metadata operations (excessive manifest count or large metadata files). This guide addresses all three.
Diagnosing Performance Issues
Step 1: Understand Your Query Pattern
Before tuning, identify what type of query is slow:
- Point lookup:
WHERE user_id = 12345→ needs bloom filters or clustering by user_id. - Range scan:
WHERE order_date BETWEEN ... AND ...→ needs partition + sort by date. - Full aggregation:
SELECT SUM(...) FROM large_table→ needs compaction + vectorized execution. - Join: Two large tables joined → needs Z-order clustering on join keys.
Step 2: Inspect Table Health Metrics
-- File count and average size (should be 100MB-500MB avg)
SELECT
COUNT(*) as file_count,
ROUND(AVG(file_size_in_bytes) / 1024.0 / 1024.0, 1) as avg_mb,
ROUND(SUM(file_size_in_bytes) / 1024.0 / 1024.0 / 1024.0, 2) as total_gb
FROM analytics.orders.files;
-- Manifest count (should be hundreds, not thousands)
SELECT COUNT(*) as manifest_count FROM analytics.orders.manifests;
-- Column statistics tightness (lower overlap = better data skipping)
SELECT
COUNT(*) as files,
ROUND(AVG(CAST(upper_bounds['customer_id'] AS BIGINT) -
CAST(lower_bounds['customer_id'] AS BIGINT)), 0) as avg_cust_range
FROM analytics.orders.files;
-- Snapshot count (how many are accumulating?)
SELECT COUNT(*) as snapshot_count FROM analytics.orders.snapshots;
Read Performance Tuning
1. Compaction: The #1 Read Performance Fix
Most Iceberg read performance problems trace back to too many small files:
-- Check if compaction is needed
SELECT COUNT(*) as small_files
FROM analytics.orders.files
WHERE file_size_in_bytes < 67108864; -- files smaller than 64MB
-- If small_files > 10% of total, run compaction
CALL system.rewrite_data_files(
table => 'analytics.orders',
strategy => 'binpack',
options => map(
'min-file-size-bytes', '67108864', -- merge files < 64MB
'target-file-size-bytes', '268435456', -- target 256MB output
'max-concurrent-file-group-rewrites', '5'
)
);
2. Partition Alignment
If queries filter on a column that’s not a partition column, pruning can’t help:
-- Current partition: months(order_date)
-- Query: WHERE customer_id = 12345 -- no partition benefit!
-- Solution A: Bloom filter on customer_id (fast to add)
ALTER TABLE analytics.orders SET TBLPROPERTIES (
'write.parquet.bloom-filter-enabled.column.customer_id' = 'true',
'write.parquet.bloom-filter-fpp.column.customer_id' = '0.05'
);
-- Then compact to generate bloom filters in new files
-- Solution B: Add a second partition level (partition evolution)
ALTER TABLE analytics.orders
ADD PARTITION FIELD bucket(customer_id, 16);
3. Z-Order Clustering for Multi-Dimensional Queries
When queries filter on multiple non-partition columns:
CALL system.rewrite_data_files(
table => 'analytics.orders',
strategy => 'sort',
sort_order => 'zorder(customer_id, product_id)'
);
-- After Z-order: queries filtering on customer_id OR product_id skip more files
4. Manifest Consolidation
Excessive manifests slow query planning:
CALL system.rewrite_manifests('analytics.orders');
-- Consolidates thousands of small manifests into fewer, larger ones
-- Reduces planning time without touching data files
5. Parquet-Level Tuning
ALTER TABLE analytics.orders SET TBLPROPERTIES (
-- Larger row groups: more data per read I/O call
'write.parquet.row-group-size-bytes' = '134217728', -- 128MB row groups
-- Smaller pages: better page-level skipping
'write.parquet.page-size-bytes' = '1048576', -- 1MB pages
-- Dictionary encoding for low-cardinality columns
'write.parquet.dict-size-bytes' = '2097152' -- 2MB dict threshold
);
Write Performance Tuning
1. Right-Size Your Commit Interval (Streaming)
# Too frequent commits (every 10 seconds) → massive small file accumulation
trigger(processingTime="10 seconds") # ❌ too small
# Better: 5-minute commits balance freshness vs. file size
trigger(processingTime="5 minutes") # ✅
2. Tune Write Parallelism
# Spark: match write parallelism to table partition count
spark.conf.set("spark.sql.shuffle.partitions", "200") # adjust to cluster size
# Set target output file count per partition
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456") # 256MB
3. Use Hash Distribution for Partitioned Writes
ALTER TABLE analytics.events SET TBLPROPERTIES (
'write.distribution-mode' = 'hash' -- group partition data per task
);
-- Prevents cross-partition data within files
4. Disable Fanout for Sorted Sources
# For sources that are already sorted by partition column
FlinkSink.forRowData(stream)
.tableLoader(tableLoader)
.overwrite(false)
.build()
# fanout-enabled=false (default): assumes data arrives partition-ordered
# fanout-enabled=true: handles out-of-order partition data (more memory)
Spark-Specific Tuning
# Critical Spark configs for Iceberg performance
spark.conf.set("spark.sql.adaptive.enabled", "true") # AQE
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Iceberg-specific
spark.conf.set("spark.sql.iceberg.split-size", "268435456") # 256MB splits
spark.conf.set("spark.sql.iceberg.target-file-size-bytes", "268435456")
# Vectorized Parquet reading
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true")
spark.conf.set("spark.sql.parquet.columnarReaderBatchSize", "4096")
Dremio-Specific Performance
Dremio’s Intelligent Query Engine provides several Iceberg-specific performance features:
- Reflections: Pre-computed, stored materializations of frequently-queried Iceberg views. Sub-millisecond response times for dashboards.
- Automatic file pruning: Dremio’s planner evaluates Iceberg manifest statistics aggressively.
- Arrow Flight delivery: Results returned via Arrow Flight for near-zero deserialization overhead.
- Automatic table optimization: Dremio Cloud auto-compacts Iceberg tables without manual scheduling.
Performance Tuning Checklist
| Issue | Diagnosis | Fix |
|---|---|---|
| Slow range queries | Many small files | Compaction + sort order |
| Slow point lookups | No bloom filters | Enable bloom filters + compact |
| Slow query planning | Thousands of manifests | rewrite_manifests |
| Full table scans | Wrong partition scheme | Partition evolution |
| Slow writes | Small commit intervals | Increase batch size / interval |
| High storage cost | Many old snapshots | expire_snapshots |
| Uneven file sizes | No sort order / bad distribution | write.distribution-mode = range |