Skip to content

Understanding Apache Iceberg's Metadata.json

Published: at 09:00 AM

Introduction

Apache Iceberg is a data lakehouse table format designed to solve many of the problems associated with large-scale data lakes turning them in data warehouses called data lakehouses. It allows for schema evolution, time travel queries, and efficient data partitioning, all while maintaining compatibility with existing data processing engines. Central to Iceberg’s functionality is the metadata.json file, which serves as the heart of table metadata management.

Purpose of metadata.json

The metadata.json file in Apache Iceberg serves several critical purposes:

This file is not just a static record but a dynamic document that evolves with the table, making it an indispensable component of Apache Iceberg’s architecture.

Detailed Breakdown of Fields

Identification and Versioning

format-version

table-uuid

Table Structure and Location

location

last-updated-ms

Schema Management

schemas and current-schema-id

Data Partitioning

partition-specs and default-spec-id

Snapshots and History

last-sequence-number, current-snapshot-id, snapshots, snapshot-log

Metadata Logging

metadata-log

Sorting and Ordering

sort-orders and default-sort-order-id

Example Metadata.json

{
  "format-version": 2,
  "table-uuid": "5f8b14d8-0a14-4e6a-8b04-7b1b9341c939",
  "location": "s3://my-bucket/tables/my_table",
  "last-updated-ms": 1692643200000,
  "last-sequence-number": 100,
  "last-column-id": 10,
  "schemas": [
    {
      "schema-id": 1,
      "columns": [
        {"name": "id", "type": "integer", "id": 1},
        {"name": "name", "type": "string", "id": 2}
      ]
    },
    {
      "schema-id": 2,
      "columns": [
        {"name": "id", "type": "integer", "id": 1},
        {"name": "name", "type": "string", "id": 2},
        {"name": "age", "type": "integer", "id": 3}
      ]
    }
  ],
  "current-schema-id": 2,
  "partition-specs": [
    {
      "spec-id": 1,
      "fields": [
        {"name": "name", "transform": "identity", "source-id": 2}
      ]
    },
    {
      "spec-id": 2,
      "fields": [
        {"name": "age", "transform": "bucket[4]", "source-id": 3}
      ]
    }
  ],
  "default-spec-id": 2,
  "last-partition-id": 4,
  "properties": {
    "commit.retry.num-retries": "5"
  },
  "current-snapshot-id": 3,
  "snapshots": [
    {"snapshot-id": 1, "timestamp-ms": 1692643200000},
    {"snapshot-id": 2, "timestamp-ms": 1692643500000},
    {"snapshot-id": 3, "timestamp-ms": 1692643800000}
  ],
  "snapshot-log": [
    {"timestamp-ms": 1692643200000, "snapshot-id": 1},
    {"timestamp-ms": 1692643500000, "snapshot-id": 2},
    {"timestamp-ms": 1692643800000, "snapshot-id": 3}
  ],
  "metadata-log": [
    {"timestamp-ms": 1692643200000, "metadata-file": "s3://my-bucket/tables/my_table/metadata/00001.json"},
    {"timestamp-ms": 1692643500000, "metadata-file": "s3://my-bucket/tables/my_table/metadata/00002.json"}
  ],
  "sort-orders": [
    {
      "order-id": 1,
      "fields": [
        {"name": "id", "direction": "ASC", "null-order": "NULLS_FIRST"}
      ]
    }
  ],
  "default-sort-order-id": 1,
  "refs": {
    "main": {"snapshot-id": 3}
  },
  "statistics": [
    {
      "snapshot-id": "3",
      "statistics-path": "s3://my-bucket/tables/my_table/stats/00003.puffin",
      "file-size-in-bytes": 1024,
      "file-footer-size-in-bytes": 64,
      "blob-metadata": [
        {
          "type": "table-stats",
          "snapshot-id": 3,
          "sequence-number": 100,
          "fields": [1, 2, 3],
          "properties": {
            "statistic-type": "summary"
          }
        }
      ]
    }
  ],
  "partition-statistics": [
    {
      "snapshot-id": 3,
      "statistics-path": "s3://my-bucket/tables/my_table/partition_stats/00003.parquet",
      "file-size-in-bytes": 512
    }
  ]
}

How Engines Use metadata.json

Query Planning

One of the primary uses of the metadata.json by data processing engines is in query planning. Here’s how:

Schema Evolution

Data Consistency

Data Layout and Sorting

Metadata Updates

Optimistic Concurrency Controls

Conclusion on Engine Usage

The metadata.json in Apache Iceberg acts as a comprehensive guide for data engines, enabling them to efficiently manage, query, and evolve large-scale data tables. By providing detailed metadata, it allows for optimizations at various levels, from query planning to data consistency, making Iceberg tables highly performant and flexible.

Resources to Learn More about Iceberg