Iceberg Positional Deletes

Positional delete files in Apache Iceberg record the exact file path and row position of deleted rows, enabling efficient row-level deletion in Merge-on-Read mode without rewriting data files, used primarily by streaming CDC frameworks like Apache Flink.

Iceberg Positional Delete Files

Positional delete files are one of the two types of delete files introduced in Apache Iceberg Spec v2. They enable row-level deletion in Merge-on-Read mode by recording the exact file path and row position (row index within the file) of every deleted row, allowing the query engine to skip those specific rows during reads without rewriting the original data files.

Structure of a Positional Delete File

A positional delete file is an Avro file with exactly two columns:

Column	Type	Description
`file_path`	string	The full URI of the data file containing the deleted row
`pos`	long	The 0-based position (row index) of the deleted row within the data file

Example content:

file_path                                           pos
s3://bucket/data/orders/part-00001.parquet          42
s3://bucket/data/orders/part-00001.parquet          10019
s3://bucket/data/orders/part-00002.parquet          7
s3://bucket/data/orders/part-00003.parquet          15887

These entries mean: “When reading part-00001.parquet, skip row 42 and row 10019. When reading part-00002.parquet, skip row 7.”

How Query Engines Apply Positional Deletes

When reading an Iceberg table with pending positional delete files:

The engine identifies which positional delete files apply to each data file being scanned (via the delete file’s partition bounds in the manifest).
For each data file, the engine loads the corresponding positional delete entries.
As the engine scans each row group within the data file, it skips rows whose positions match a delete entry.
The deleted rows are never returned to the query result.

The skip operation happens at the row level: not at the row group or file level, so the engine must still open and scan the data file, it just omits specific rows.

Positional Deletes vs. Equality Deletes

Aspect	Positional Deletes	Equality Deletes
What is recorded	File path + row position	Column values
How applied	Skip specific positions	Filter rows by value match
Requires knowing row position	Yes	No
Efficiency	Very efficient	Less efficient (join-like scan)
Best generated by	Streaming CDC engines (Flink)	Batch DML (Spark UPDATE/DELETE)
Use case	High-throughput streaming deletes	Business-logic deletes by ID

When Positional Deletes Are Generated

Positional deletes are generated by write engines that process data file contents and know exact row positions:

Apache Flink (primary user): Flink’s Iceberg sink generates positional delete files for upsert operations in CDC pipelines. When Flink processes a CDC UPDATE, it generates a positional delete for the old row (at the known position in the data file) and appends the new row.
Apache Spark (CoW-default, MoR possible): Spark can generate positional delete files when configured with write.delete.mode=merge-on-read.

Positional Delete File Scope

Positional delete files are scoped to specific data files via partition bounds stored in their manifest entries. This allows query planning to skip positional delete files that don’t apply to the data files being scanned:

A positional delete file for part-00001.parquet is only loaded when part-00001.parquet is being scanned.
Queries scanning only part-00002.parquet never load the delete file for part-00001.parquet.

This scoping is what prevents positional delete files from becoming a global performance bottleneck.

Compaction: Applying Positional Deletes

Positional delete files accumulate over time and must be applied via compaction to restore full read performance:

-- Spark: rewrite data files to apply positional deletes
CALL system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort'  -- applies all pending delete files during rewrite
);

After compaction, the rewritten data files contain no deleted rows, and the positional delete files are removed from the table’s manifest, reducing read overhead to zero.