Apache DataFusion
Apache DataFusion is a fast, extensible SQL query engine written in Rust. It is designed to act as a SQL parser and execution engine for custom databases, analytics platforms, and data pipelines. DataFusion uses Apache Arrow as its in-memory representation, enabling vectorized execution and gRPC-based data transfers.
Iceberg Integration Scope
DataFusion integrates with Apache Iceberg natively, primarily through the official iceberg-datafusion crate (part of the iceberg-rust project). This integration implements DataFusionβs core trait interfaces:
- Table Provider Implementation: By implementing the
TableProvider,SchemaProvider, andCatalogProvidertraits, the crate allows DataFusion to recognize Iceberg tables as standard SQL tables. - SQL DDL Support: DataFusion can execute SQL Data Definition Language (DDL) commands, such as
CREATE TABLEandDROP TABLE, directly on Iceberg catalogs. - Updates and Appends: The engine supports
INSERT INTODML operations, writing new records as compliant Parquet files and generating the corresponding metadata commits. For partitioned tables, the engine automatically handles sort-based clustering during writes to prevent the small file problem. - Query Optimizations: The integration supports filter predicate pushdowns (on types like Booleans, strings, and Timestamps) and limit pushdowns, allowing DataFusion to skip reading irrelevant files by analyzing Iceberg manifest statistics.
- Transactional Reading: The engine resolves positional and equality delete logs, applying them dynamically during execution to support Merge-on-Read (MoR) tables.