Understanding Parquet File Format

by Arpit Kumar

18 Oct, 2023

9 minute read

Exploring the evolution of big data analytics and role played by the Parquet file format in optimizing storage, query performance, and data processing. We will learn how Parquet’s columnar storage, compression options, and support for predicate pushdown have transformed the landscape of big data analytics

# parquet # bigdata # analytics

Table of Content

Emergence and Purpose

How Parquet Works

What is a predicate pushdown ?

Internal Parquet File Structure

Compression in Parquet

Adoption Across Companies

Post 2005 we started hearing a lot about big data and post 2010 it kind of became necessary for all businesses to gather as much data as possible and run analytics on top of it to better understand the customers, product, machines etc.

From earlier solutions of hadoop based map reduce jobs from Yahoo to Hive based storage and compute solutions coming out of facebook, tech world kept on moving towards ease of setup, standardization, cost effectiveness and speed of processing. Tools like Apache Spark, presto query engine started as open source projects and became ubiquitous across the big data ecosystem. Slowly people started hiring and building teams just for big data analytics, data science and data storage.

As data kept growing the necessity for further optimisation in the storage, query planning, query execution kind of became the crux of data lake solutions. While most of the earlier use case was based on the batch processing of data it started shifting towards streaming solutions which can provide insights in near real time. This elevated the expectations of business and customers to see things as soon as possible. This made programmers think about optimizing on how they are storing data, how much data they scan for a query and how it’s filtered/aggregated in memory.

These are the three parts of any analytics query - Planning, Scanning, Execution. All three pieces of querying engine and storage are important for getting the results at least cost and lowest latency possible.

Here we are going to focus on the two pieces here, data scan and storage.

The most common and easy to use format which is easily understood and is human readable is comma separated values (csv) files. They can be loaded in memory or tools like excel (if size can fit in memory). CSV/TSV still gets used in most of the places where data size is not that big. But we need more efficient solutions to work on bigger data.

Let’s take an example if we have stored the data in CSV files and we have 1Tb of data then probably we will split the data into multiple files and when the query is made by the analytics engine it will load all the data from disk into memory and try to filter, aggregate that data in-memory. This is because the query engine cannot know which data is present in which file unless it reads the file. So to return the correct result we need to check the availability of data by reading all the files and then return the result.

Obviously this is a very crude way of explaining the inefficiencies but that’s the crux of the problem. We can also reduce the data size by compressing the data while storing. We can pick zstd, snappy etc algorithms to compress the data and reduce the size of data on the disk. This would reduce my disk storage cost and read cost but would still require the same amount of memory to filter/aggregate after the uncompression.

One of the file formats which helps in solving these problems which we discussed is Parquet. It is an open-source columnar storage file format that was designed for use with big data processing frameworks, such as Apache Hadoop and Apache Spark. It emerged as a technology to address the limitations of row-based storage formats like CSV and TSV for certain types of workloads and use cases.

Emergence and Purpose

Parquet was created as a collaborative effort by Cloudera and Twitter to address the shortcomings of existing storage formats, particularly for use cases involving large-scale data processing. It was designed to provide efficient storage and retrieval of data for analytics and big data workloads. Parquet’s primary goal is to optimize performance by storing data in a way that makes it easy to read and process only the columns that are necessary for a particular query.

The creation of Parquet was motivated by several specific problems and challenges that companies faced in its big data processing and storage systems. General issues that motivated Parquet’s creation include:

Efficient Storage for Large Datasets: Many tech companies were dealing with vast amounts of data. Storing and managing these large datasets efficiently was a primary concern. Existing storage formats like CSV and JSON were not well-suited for these use cases, as they are row-based and lack the columnar organization that can significantly improve storage and query performance
Query Performance: Traditional storage formats were not optimized for analytical query performance. When dealing with analytical workloads, it’s essential to retrieve and process data efficiently. Columnar storage formats like Parquet and ORC were developed to address this issue by enabling column-level access and compression.
Data Schema Evolution: Apart from basic problems discussed earlier one of the challenging problems is data schema evolution. It is a common challenge in fast-growing companies. Parquet was designed to accommodate evolving data schemas without requiring extensive data transformation or conversion processes. This flexibility was essential in dealing with changing data structures over time.
Optimized Compression: Storing large volumes of data can be costly in terms of storage space. Parquet addressed this issue by supporting a variety of compression codecs that allows saving on storage costs without sacrificing query performance.

How Parquet Works

Parquet stores data in a columnar format, and it uses a combination of several techniques to improve query performance:

Columnar Storage: Data for each column is stored together, making it efficient to read only the necessary columns for a query.
Compression: Parquet supports various compression algorithms, such as Snappy, Gzip, and LZO, which can be applied to individual columns. This reduces storage space and speeds up data access.
Predicates Pushdown: The columnar nature of Parquet allows query engines to push down predicates and apply them at the column level, which reduces the amount of data that needs to be read from disk.
Schema Evolution: Parquet supports schema evolution, which means you can add, remove, or change columns in a backward-compatible manner, making it easier to handle changing data structures.

What is a predicate pushdown ?

It’s important to understand this concept first to understand how Parquet helps in optimizing the scan and data load for a query.

Predicate pushdown is an optimization technique used in database and data processing systems to improve query performance by pushing filtering conditions (predicates) as close to the data source as possible. The idea is to apply these filtering conditions early in the data retrieval process, reducing the amount of data that needs to be processed and transferred.

In the context of databases and data processing frameworks, here’s how predicate pushdown works:

Query Parsing: When you submit a query, it typically includes filtering conditions (predicates) in the WHERE clause. For example, you might have a query like SELECT * FROM table WHERE age > 30.
Data Source: The data source can be a database, a data warehouse, or a distributed file system. It contains the actual data that the query needs to access.
Traditional Approach: Without predicate pushdown, the system might first retrieve all the data from the data source and then apply the filtering conditions during processing. This means that all the data, including irrelevant rows, must be transferred and processed, which can be inefficient for large datasets.
Predicate Pushdown: With predicate pushdown, the system optimizes the query by pushing the filtering conditions down to the data source, where they are applied as close to the data as possible. In the example query, the system might apply the age > 30 filter at the data source, and only the rows that satisfy this condition are transferred and processed.

Advantages of Predicate Pushdown

Reduced Data Transfer: By filtering at the source, you minimize the amount of data transferred over the network, which can be a significant performance gain, especially in distributed systems.
Improved Query Performance: Filtering data early in the process reduces the amount of data that needs to be processed, leading to faster query execution.
Resource Efficiency: Fewer computing and memory resources are required because there is less data to handle.
Scalability: In distributed systems, predicate pushdown helps distribute the processing load more efficiently among nodes, making it easier to scale horizontally.

Predicate pushdown is a common optimization technique used in relational databases, data warehouses, and big data processing frameworks like Apache Hive, Apache Spark, and Apache Impala. It significantly improves the efficiency of data retrieval and query processing, making it an important tool for optimizing query performance in various data processing scenarios. So any file format which basically helps in supporting predicate pushdown is going to speed things up a lot even if the total data stored is huge. Let’s dive deeper into the structure of the parquet file format to understand how it organizes the data inside the file.

Internal Parquet File Structure

Parquet is a self described format which contains both the data and metadata in the file.

Paquet File Format — https://parquet.apache.org/docs/file-format/metadata/

Broadly Parquet files consist of four main components:

File Metadata: This includes information about the file, such as the file schema, compression settings, and other metadata.
Row Groups: The data is divided into row groups, where each row group contains a subset of rows from the file. Each row group stores the data for all the columns.
Column Chunk: Each column in a row group is further divided into column chunks. These chunks store data and metadata for a single column.
Data Pages: The actual data for each column is stored in data pages. Data pages can be compressed, and Parquet supports different compression codecs for data pages.

If you can’t remember the complex structure just remember this basic structure for reference

Compression in Parquet

Parquet supports various compression codecs, and the choice of codec depends on your specific requirements. Popular compression codecs used with Parquet include:

Snappy: Snappy is a fast compression algorithm that provides a good balance between compression ratio and decompression speed. It’s commonly used for Parquet files when query performance is critical.
Gzip: Gzip provides higher compression ratios but can be slower to decompress. It is used when minimizing storage space is more important than query speed.
LZO: LZO is another compression option with a good balance between compression and decompression speed. It’s often used in Hadoop-based ecosystems.

Different compression for each column

Parquet allows you to select the compression codec for each column independently, which can be useful for optimizing storage and query performance based on the data characteristics.

This ability to apply column-level compression is particularly useful in scenarios where some columns may benefit from higher compression ratios while others require faster decompression for query performance. For example, you might choose to use a high-compression codec like Gzip for a column with textual data to save on storage space, while using a faster codec like Snappy for a numerical column to enhance query speed.

Adoption Across Companies

Parquet has gained wide adoption across various industries and companies, particularly those dealing with big data analytics. Many technology giants, such as Cloudera, Facebook, Twitter, and Netflix, have contributed to its development and widely use it. Additionally, Parquet’s open-source nature has led to broad industry support and adoption.

Currently most of the streaming / batch data processing engines have sink connectors for dumping files into Parquet format. The rich ecosystem makes use of Parquet for every big data analytics team an easy choice.