Understand how modern open table formats like apache iceberg can help in traditional data lake problems effectively and efficiently
In the early days of the internet, businesses experienced a transformative shift in the way they operated, primarily due to the newfound ability to connect with a global audience. The internet facilitated the discovery of demand and supply on a previously unimaginable scale.
One of the most significant contributions of the internet to business was the ability to capture transaction data. E-commerce platforms emerged, allowing businesses to conduct transactions online and gather valuable data about customer preferences, buying patterns, and market trends.
As the internet continued to evolve, we saw an explosion in the capture of behavioral data. Websites began collecting data on user interactions, preferences, and online behaviors. This data became instrumental in shaping personalized user experiences, targeted marketing strategies, and more efficient business operations.
With the proliferation of data, businesses faced the challenge of handling and making sense of vast amounts of structured and unstructured information. This marked the inception of the big data era.
Companies started to recognize the value of not only transaction and behavioral data but also diverse data sources such as social media, sensor data, and log files. The increasing availability of powerful computing resources allowed businesses to process and analyze this data at unprecedented speeds.
The concept of a “data lake” emerged as a solution to store and manage the massive volumes and varieties of data generated by businesses. Unlike traditional databases, data lakes provided a flexible and scalable architecture to store both structured and unstructured data.
This allowed businesses to break down data silos and create a centralized repository for all their data, making it easier to derive insights and make informed decisions.
The combination of big data and data lakes paved the way for advanced analytics, machine learning, and artificial intelligence applications in business.
As the data size kept increasing blob storage became the de facto storage medium and queries are run on top of blob storage like S3.
Early storage formats and processing engines needed to load the files and process the data as per the query clause. This posed multiple challenges about speed of execution, scanning over huge data size and large memory and compute capacity for execution. On top of it, as blob stores don’t directly support ACID transaction developers needed to invent other ways to handle the updates.
Open Table Formats
The primary goal of open table formats is to provide a flexible and efficient means of organizing, storing, and retrieving structured data. They typically offer a schema-on-read approach, allowing for data to be stored without a strict schema, and the structure of the data is interpreted when it’s read or queried.
There are three main open table formats which are evolving in the ecosystem - Apache Iceberg, Apache Hudi and Delta Lake. We are going to primarily focus on Iceberg in this article.
Technical overview of Apache Iceberg
Let’s understand how Apache Iceberg solves the common problems like huge data size, query performance and schema evolution, ACID transaction etc
Scalability
Iceberg uses a concept called “manifest files” that store metadata about the files in a table. This metadata includes file locations, sizes, and other details. By organizing data into these manifest files, Iceberg can efficiently manage and query large datasets without needing to scan the entire dataset for every operation. This architecture allows for scalability, as it’s easier to locate and access specific data within these manifests.
Hierarchy Relationship
- The actual data files contain the raw data.
- Manifest files store metadata about these data files, including their locations, sizes, and other attributes.
- The manifest list organizes and references multiple manifest files, creating a historical view or timeline of changes to the dataset.
This hierarchy and organization allow Iceberg to efficiently manage metadata separately from the actual data. The separation of metadata and data files allows Iceberg to track changes, support schema evolution, and provide consistent access to data while efficiently managing large-scale datasets. The manifest list acts as a timeline, tracking these changes, while individual manifest files serve as snapshots of the dataset at specific points in time
When data is inserted, updated, or deleted in Apache Iceberg, the metadata, manifest files, and manifest list are updated to reflect these changes. Here’s how it typically happens:
Data Insertion
- New Data Files: When new data is inserted, Iceberg writes the new data files to the storage system (e.g., HDFS, cloud storage).
- New Manifest File: Iceberg generates a new manifest file that includes metadata about these newly inserted data files. This manifest file contains information about the file locations, sizes, partition information, and other relevant metadata.
- Manifest List Update: Iceberg appends this new manifest file to the manifest list, reflecting the changes and creating a new snapshot of the table’s state.
Data Update
- Update Operation: Iceberg handles updates as a combination of a delete followed by an insert. When data is updated, the corresponding original data file is marked for deletion.
- New Data Files & Manifest: The updated data is written as new data files, and a new manifest file is generated to reflect these changes.
- Manifest List Update: This new manifest file is added to the manifest list, indicating the changes made to the dataset.
Data Deletion:
- Deletion Operation: When data is deleted, Iceberg marks the specific data file for deletion.
- Deletion Marker: A deletion marker is added to the manifest file to indicate that specific data files are no longer considered part of the active dataset.
- Manifest List Update: The updated manifest file with the deletion marker is included in the manifest list, signaling the removal of specific data files from the table’s state.
The sequence of these actions ensures that the metadata and manifest files accurately reflect the changes in the dataset. Iceberg maintains a historical log of these changes by adding new manifest files to the manifest list, creating a timeline of the table’s evolution. This mechanism allows users to access the data at different points in time and provides a consistent view of the dataset despite ongoing insertions, updates, and deletions.
Performance
Iceberg optimizes query performance in several ways. We discussed the manifest list and its use while mutating the data but this helps while reading the data too. It leverages statistics and metadata pruning to skip irrelevant data files when executing queries. This way, it only accesses the necessary files, significantly improving performance. Iceberg also utilizes partitioning strategies, reducing the amount of data that needs to be scanned during query execution.
Partitioning: Iceberg supports table partitioning, which involves organizing data based on specific keys or attributes. Partitioning can significantly impact query performance:
-
Partition Pruning: Iceberg uses partitioning information to skip irrelevant partitions during query execution. This optimization technique, known as partition pruning, reduces the amount of data that needs to be scanned. By excluding partitions that are not relevant to the query conditions, Iceberg significantly improves query performance.
-
Data Organization: Partitioning also allows data to be organized in a way that aligns with common query patterns, improving query performance by reducing the volume of data that needs to be processed.
-
Predicate Pushdown: Iceberg’s partitioning strategy enables predicate pushdown, where conditions specified in the query are pushed down to the storage layer, filtering out irrelevant data early in the query process. This minimizes the amount of data Iceberg needs to access and process during query execution.
Overall, the hierarchical metadata structure in Iceberg and the effective use of partitioning techniques such as partition pruning and predicate pushdown significantly contribute to enhanced query performance. These strategies reduce the amount of data scanned and processed during queries, leading to faster and more efficient data retrieval and analysis.
ACID Support
Iceberg employs a combination of concurrency controls, snapshot isolation, transaction support, and metadata management, which ensures that even in scenarios with concurrent read and write operations, the data remains consistent and the integrity of the dataset is maintained, meeting ACID compliance standards.
-
Concurrency Controls: Iceberg employs concurrency control mechanisms to handle simultaneous read and write operations. This involves managing access to shared resources (data files and metadata) to ensure that multiple transactions can execute concurrently without compromising consistency.
-
Snapshot Isolation: Snapshot isolation is a key technique used by Iceberg to maintain data consistency in concurrent read and write scenarios. It ensures that each transaction operates on a consistent snapshot of the data. Here’s how it works:
- Read Consistency: When a transaction reads data, Iceberg ensures that the transaction sees a consistent snapshot of the data as of the beginning of the transaction. This prevents the transaction from being affected by changes made by other concurrent transactions.
- Write Consistency: Iceberg ensures that write operations by a transaction are isolated from other transactions until the transaction is committed. This isolation prevents other transactions from seeing the changes made by an uncommitted transaction, maintaining a consistent snapshot of the data for other transactions.
-
Transaction Support: Iceberg supports transactional operations, enabling multiple operations to be executed as a single unit of work. This includes multiple read and write operations that are either committed together or rolled back if any part of the transaction fails. Iceberg maintains the integrity of these transactions by tracking changes and ensuring that the changes are only applied if the entire transaction is successful.
-
Metadata Locks and Versioning: Iceberg uses metadata locks to control access to metadata during concurrent operations. It employs versioning mechanisms to handle schema and metadata changes, ensuring that multiple operations don’t interfere with each other and that the metadata remains consistent and valid across transactions.
These controls allow for multiple transactions to execute concurrently without compromising the reliability and consistency of the data.
Schema Evolution
This one is huge as any developer would know that schema keeps evolving based on business evolution and product iterations. Iceberg supports schema evolution and accommodates changes in the structure of the data without interrupting ongoing reads or writes. It allows for alterations to the schema, such as adding new columns or modifying existing ones, while ensuring that both old and new data can coexist and be queried effectively. Let’s learn a bit more about the technical aspects involved in Iceberg’s schema evolution:
-
Time-Traveling Metadata: Iceberg maintains historical metadata, allowing it to track the changes made to the schema over time. Each manifest file contains metadata about the schema as it was when the data files were written. This time-traveling metadata captures schema changes, making it possible to query the data as it was at specific points in time.
-
Metadata Evolution: When a schema change occurs (e.g., adding a column), Iceberg updates its metadata to reflect these changes. The new schema is recorded separately in the metadata, while the historical metadata preserves the schema at the time the data was written. This allows Iceberg to understand and interpret both old and new schemas, enabling queries to access and make sense of the data according to the schema relevant at the time the data was ingested.
-
Projection and Read Adaptability: Iceberg handles schema evolution by adapting reads to the available schema at the time the data was written. It projects the schema of the data files to match the schema specified in the query, filling in missing columns with default values or nulls. This adaptation ensures that both old and new data can be queried without data loss or errors due to schema changes.
-
Compatibility Checks: Iceberg performs compatibility checks to ensure that the new schema changes are compatible with the existing data. It validates that the new schema is backward-compatible, allowing existing data to be read with the new schema without conflicts. If there are issues with backward compatibility, Iceberg may reject the schema change or suggest a migration process to handle the existing data effectively.
-
Evolution Operations: Iceberg provides commands and APIs to execute schema evolution operations like adding columns, changing types, and renaming columns. These operations ensure that new data adheres to the updated schema while allowing existing data to be seamlessly queried using the original schema.
These technical approaches collectively enable Iceberg to provide a robust solution for managing and querying large-scale, evolving datasets with improved performance, scalability, and data consistency.
Community Support
For the success of any project community support and documentation is very important. Best projects without helpful community and documentation don’t get enough adoption and die out.
Iceberg, as an apache incubated open-source project, has a growing ecosystem and support for developer tools that facilitate its adoption and usage within the broader data processing landscape.
It provides a set of APIs and libraries that enable developers to interact with and manage Iceberg tables. Libraries and SDKs have been developed to simplify integration with different programming languages and frameworks, contributing to its ease of use.
It seamlessly integrates with various popular data processing frameworks like Apache Spark, Apache Flink, Apache Beam, and more. This integration ensures that developers can leverage Iceberg’s features within the workflow of these established tools, making it easier to adopt Iceberg without major modifications to existing systems.
You can easily refer to their wonderful document and a lot of tutorials and case studies floating on youtube. One which i specifically liked is this one from Apple Team.