Skip to content

Databases Deconstructed - The Value of Data Lakehouses and Table Formats

Published: at 09:00 AM

Databases and data warehouses are powerful systems that simplify working with data by abstracting many of the inherent challenges, including:

The drawback of having such tightly coupled systems is that the data within them is only understood by that specific system. Therefore, if another system is needed for a particular use case, the data must be migrated and duplicated into that other system. While this is often manageable for transactions (e.g., adding a user, updating a user, recording a sale) as a single database system can handle all CRUD operations (Create, Read, Update, Delete), it becomes more problematic in analytics. Analytical use cases are far more diverse, as are the tools required to support them, making data migration and duplication cumbersome and inefficient.

Enter the Lakehouse

In analytics, the status quo has been to duplicate your data across multiple systems for different use cases:

In an ideal world, you wouldn’t need the costs and complexity of duplicating your data across systems and figuring out how to keep it all consistent. This is where the data lakehouse pattern comes in. The data lakehouse is an architectural pattern that essentially builds a deconstructed database using your data lake as the storage layer. The benefit is that structured data can now exist once in your data lake, and both data lakehouse tools and data warehouse tools can access it.

Let’s examine the construction of a data lakehouse layer by layer.

The Storage Layer

The basic foundation of a data lakehouse is the storage layer, where we need to determine where and how to store the data. For the “where,” the obvious choice is object storage.

What is Object Storage?

Object storage is a data storage architecture that manages data as objects, as opposed to file systems that manage data as a file hierarchy, or block storage which manages data as blocks within sectors and tracks. Each object includes the data itself, a variable amount of metadata, and a unique identifier. This approach is highly scalable, cost-effective, and suitable for handling large amounts of unstructured data.

Benefits of Object Storage:

We can opt for major cloud vendors like AWS, Azure, and Google Cloud. However, many other storage vendors, such as NetApp, Vast Data, MinIO, and Pure Storage, provide additional value in object storage solutions both in the cloud and on-premises.

Next, we need to determine how we will store the data on the storage layer. The industry standard for this is Apache Parquet files.

What are Apache Parquet Files?

Apache Parquet is a columnar storage file format optimized for use with big data processing frameworks. It is designed for efficient data storage and retrieval, making it ideal for analytic workloads.

Why Parquet Files are Good for Analytics:

How Row Groups Work in Parquet:

Parquet files are divided into row groups, subsets of the data that can be processed independently. Each row group contains column chunks, each of which consists of pages. This structure enables efficient reads by allowing queries to skip irrelevant data and read only the necessary columns and rows.

+-----------------------------------------------------+
|                     Parquet File                    |
+-----------------------------------------------------+
|                    File Metadata                    |
|                                                     |
| - Schema                                            |
| - Key-Value Metadata                                |
| - Version                                           |
+-----------------------------------------------------+
|                    Row Group 1                      |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 1                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 2                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  |                     ...                       |  |
|  +-----------------------------------------------+  |
+-----------------------------------------------------+
|                    Row Group 2                      |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 1                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 2                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  |                     ...                       |  |
|  +-----------------------------------------------+  |
+-----------------------------------------------------+
|                     ...                             |
+-----------------------------------------------------+

By establishing a robust storage layer with object storage and using Apache Parquet files for data storage, we create a strong foundation for our data lakehouse. This setup ensures scalability, efficiency, and compatibility, essential for handling diverse and extensive data analytics workloads.

The Table Format

While Parquet files are excellent for storing data for quick access, datasets can eventually grow large enough to span multiple files. Parquet files are only aware of themselves and are unaware of other files in the same dataset. This puts the analyst responsible for defining the dataset, which can lead to mistakes where a file isn’t included or extra files are included, resulting in inconsistent data definitions across use cases. Additionally, engines still need to open every file to execute a query, which can be time-consuming, especially if many files aren’t needed for the specific query.

In this case, we need an abstraction that helps do a few things:

This abstraction allows for faster scanning of large datasets and consistent results. It is known as a “table format,” a standard for how metadata is written to document the files in the table along with their statistics.

Currently, there are three main table formats: Apache Iceberg, Apache Hudi, and Delta Lake.

Apache Iceberg is thought to have recently established itself as the industry standard.

The Catalog

Now that we have folders with metadata and data that comprise a table, processing tools need a way to know these tables exist and where the metadata for each table can be found. This is where the lakehouse catalog comes into play. A lakehouse catalog can perform several functions:

The catalog becomes the mechanism for bundling your tables and making them accessible to your preferred data processing tools. Currently, there are four main lakehouse catalogs offering solutions for this layer:

You can efficiently manage and access your data lakehouse tables by leveraging these catalogs, ensuring seamless integration with various data processing tools.

Data Processing

Now that we have everything needed to store and track our data, we just need a tool to access that data and run queries and transformations for us. One such tool is Dremio, a data lakehouse platform created to make working with data lakehouses easier, faster, and more open.

Dremio provides:

Dremio becomes a powerful tool for unifying and organizing your data into data products for your users. Since your data is in a lakehouse, several tools can be part of the picture. For example, you can use Upsolver, Fivetran, or Airbyte to ingest data into your lakehouse, run graph queries on your lakehouse with Puppygraph, and explore many other possibilities—all without needing multiple copies of your data.

Conclusion

We’ve deconstructed the traditional database and data warehouse systems to highlight the value of data lakehouses and table formats. Databases and data warehouses simplify working with data by abstracting many inherent challenges, such as storage, table management, cataloging, and query processing. However, the tightly coupled nature of these systems often necessitates data duplication across different systems, leading to increased complexity and inefficiency.

The data lakehouse architecture emerges as a solution, offering the best of both data lakes and data warehouses. By leveraging object storage and Apache Parquet files, we establish a robust storage layer that ensures scalability, efficiency, and compatibility. The introduction of table formats like Apache Iceberg, Apache Hudi, and Delta Lake further enhances our ability to manage large datasets effectively.

To manage and track our data, lakehouse catalogs like Nessie, Polaris, Unity OSS, and Gravitino provide essential functionalities such as tracking tables, providing metadata references, and governing access. Finally, tools like Dremio offer potent data processing capabilities, enabling unified analytics, efficient query execution, and seamless lakehouse management.

By adopting a data lakehouse architecture, we can streamline data management, reduce costs, and accelerate time to insight, all while maintaining the flexibility to integrate various tools and technologies.

To learn more about the practical implementation of these concepts, be sure to check out my Apache Iceberg Crash Course and get a free copy of “Apache Iceberg: The Definitive Guide”:

GET HANDS-ON

Below are list of exercises to help you get hands-on with Apache Iceberg to see all of this in action yourself!