Skip to content
Operations & Optimization Last updated: May 14, 2026

Iceberg Cost Optimization

Cost optimization for Apache Iceberg lakehouses targets storage costs (snapshot expiration, compression, tiering), compute costs (compaction efficiency, query pruning), and catalog costs (managed service selection) to minimize total cost of ownership while maintaining performance.

iceberg cost optimizationiceberg storage costiceberg compute costreduce iceberg costsiceberg lakehouse cost

Iceberg Cost Optimization

A well-designed Apache Iceberg lakehouse can be dramatically more cost-efficient than traditional data warehouses — but poorly maintained Iceberg tables can accumulate significant unnecessary storage and compute costs. This guide covers the key cost optimization strategies across storage, compute, and catalog.

Storage Cost Optimization

1. Snapshot Expiration (Most Impactful)

Old snapshots retain references to data files that are no longer in the current table state but cannot be garbage collected until the snapshot is expired. Long retention windows are the #1 source of runaway Iceberg storage costs.

-- Expire snapshots older than 7 days, keep at least 10
CALL system.expire_snapshots(
    table => 'analytics.orders',
    older_than => TIMESTAMP '{{ macros.ds_add(ds, -7) }} 00:00:00',
    retain_last => 10
);

Sizing retention windows appropriately:

-- Tag a snapshot for long-term ML reproducibility BEFORE expiring old snapshots
ALTER TABLE analytics.user_features CREATE TAG `ml_training_2026_q1`
AS OF VERSION 8027658604211071520;

-- Now expire aggressively; the tag protects this snapshot
CALL system.expire_snapshots(
    table => 'analytics.user_features',
    older_than => TIMESTAMP '2026-04-01 00:00:00',
    retain_last => 5
);

2. Orphan File Cleanup

Orphan files (data files not referenced by any active snapshot) accumulate from failed writes, aborted commits, and table schema changes. They add pure storage cost with zero query benefit.

-- Run periodically (weekly) with a 72-hour safety buffer
CALL system.remove_orphan_files(
    table => 'analytics.orders',
    older_than => TIMESTAMP '{{ macros.ds_add(ds, -3) }} 00:00:00'
);

3. Compression Optimization

Choosing the right compression codec reduces storage costs significantly:

CodecCompression RatioRead SpeedWrite Speed
zstd (level 3)ExcellentFastGood
gzipBestSlowSlow
snappyGoodVery fastVery fast
noneNone (baseline)FastestFastest

For cold/archival Iceberg tables, gzip compression maximizes storage savings:

ALTER TABLE archive.historical_orders SET TBLPROPERTIES (
    'write.parquet.compression-codec' = 'gzip'
);
-- Next compaction will rewrite files with gzip, reducing size 20-40%

4. Storage Tiering

Move old Iceberg data files to cheaper storage tiers:

AWS S3 Intelligent-Tiering: Automatically moves objects to lower-cost tiers based on access patterns. Iceberg data files older than 30 days with no reads are automatically moved to cheaper storage.

S3 Lifecycle rules for Iceberg data:

{
  "Rules": [
    {
      "Filter": { "Prefix": "warehouse/" },
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER_IR" }
      ],
      "Status": "Enabled"
    }
  ]
}

Note: Apply lifecycle rules only to data files (.parquet, .orc), not to metadata files — metadata must remain accessible for query planning.

5. Column Selection (Projection Pushdown)

Ensure applications only request needed columns. Reading wide tables when only a few columns are needed wastes I/O and increases query costs:

# ✅ Good: select only needed columns
table.scan(selected_fields=("order_id", "total", "order_date")).to_arrow()

# ❌ Bad: full table scan when only 3 columns needed
table.scan().to_arrow()

Compute Cost Optimization

1. Effective Partition Pruning

If queries consistently scan the full table despite filters, partitioning is wrong or missing. Each full scan wastes compute proportional to table size.

-- Verify pruning is working (check query plan)
EXPLAIN SELECT * FROM analytics.orders WHERE order_date = '2026-05-14';
-- Look for: "FileScanTask: 3/5000 files" (good) vs "5000/5000" (bad)

2. Compaction ROI

Compaction itself costs compute — but it pays back in reduced scan costs:

Without compaction: 10,000 small files → query reads 500 files × 5MB = 2.5TB scanned
With compaction:      50 large files  → query reads   5 files × 250MB = 1.25GB scanned

Savings: 2,000× less I/O per query

Run compaction when the scan savings across expected query volume exceeds compaction cost.

3. Reflections (Dremio)

Dremio Reflections pre-materialize frequently-queried Iceberg aggregations:

Without reflection: Every dashboard query scans 5TB Iceberg table → $$$
With reflection:    Dashboard queries hit pre-computed 500MB materialization → $

Reflections eliminate the compute cost of repeated heavy scans for static dashboards.

4. Right-Size Compute Clusters

For batch ETL on Iceberg (Spark on EMR/Dataproc):

Catalog Cost Optimization

CatalogCost ModelOptimization
AWS GluePer API call ($0.01/10K requests)Cache catalog operations, reduce ListTables frequency
S3 TablesPer operation + storageBatch operations, use auto-compaction to reduce file count
Apache Polaris (self-hosted)Infrastructure costRight-size Polaris server; scale horizontally for load
NessieInfrastructure costSingle small instance for moderate workloads

Cost Monitoring

-- Monitor storage growth (run weekly, alert on unexpected growth)
SELECT
    DATE(committed_at) as commit_date,
    SUM(CAST(summary['added-files-size'] AS BIGINT)) / 1024 / 1024 / 1024.0 as gb_added,
    SUM(CAST(summary['removed-files-size'] AS BIGINT)) / 1024 / 1024 / 1024.0 as gb_removed
FROM analytics.orders.snapshots
WHERE committed_at >= TIMESTAMP '2026-05-01 00:00:00'
GROUP BY 1
ORDER BY 1;

Alert if gb_added consistently exceeds gb_removed without corresponding business growth — it indicates orphan accumulation or insufficient snapshot expiration.

📚 Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base