Apache Parquet provides several benefits for data storage and retrieval when compared to traditional methods like CSV.
Parquet format is designed for faster data processing of complex types. In this article, we talk about how the Parquet format is suitable for today’s ever-growing data needs.
Before we dig into the details of Parquet format, let us understand what CSV data is and the challenges it poses for data storage.
What is CSV storage?
We have all heard a lot about CSV (Comma Separated Values) – one of the most common ways of organizing and formatting data. CSV data storage is row-based. CSV files are stored with .csv extension. We can store and open CSV data using Excel, Google Sheets, or any text editor. The data is readily viewable once the file is opened.
Well, that’s not good – definitely not for a database format.
Further, as data volume grows, it becomes difficult to query, manage and retrieve.
Here is an example of data stored in a .CSV file:
EmpId,First name,Last name, Division 2012011,Sam,Butcher,IT 2013031,Mike,Johnson,Human Resource 2010052,Bill,Matthew,Architect 2010079,Jose,Brian,IT 2012120,Adam,James,Solutions
If we view it in Excel, we can see a row-column structure as below:
Challenges with CSV storage
Row-based storages like CSV are suitable for Create, Update and Delete operations.
What about the Read in CRUD, then?
Imagine a million rows in the above .csv file. It would take a reasonable amount of time to open the file and search for the data you are looking for. Not so cool. Most cloud providers like AWS charge companies based on the amount of data scanned or stored – again, CSV files consume a lot of space.
CSV storage doesn’t have an exclusive option to store metadata, making data scanning a tedious task.
So, what’s the cost-effective and optimal solution for performing all the CRUD operations? Let us explore.
What is Parquet data storage?
Parquet is an open-source storage format to store data. It is widely used in Hadoop and Spark ecosystems. Parquet files are stored as .parquet extension.
Parquet is a highly structured format. It can also be used to optimize complex raw data present in bulk in data lakes. This can significantly reduce query time.
Parquet makes data storage efficient and retrieval faster because of a mix of row and columnar-based (hybrid) storage formats. In this format, the data is partitioned horizontally as well as vertically. Parquet format also eliminates the parsing overhead to a large extent.
The format restricts the overall number of I/O operations and, ultimately, the cost.
Parquet also stores the metadata, which stores information about data like the data schema, number of values, location of columns, min value, max value number of row groups, type of encoding, etc. The metadata is stored at different levels in the file, making data access faster.
In row-based access like CSV, data retrieval takes time as the query has to navigate through each row and get the particular column values. With Parquet storage, all the required columns can be accessed at once.
- Parquet is based on the columnar structure for data storage
- It is an optimized data format to store complex data in bulk in storage systems
- Parquet format includes various methods for data compression and encoding
- It significantly reduces data scan time and query time and takes less disk space compared to other storage formats like CSV
- Minimizes the number of IO operations, lowering the cost of storage and query execution
- Includes metadata which makes it easier to find data
- Provides open-source support
Parquet data format
Before going into an example, let’s understand how data is stored in the Parquet format in more detail:
We can have multiple horizontal partitions known as Row groups in one file. Within each Row group, vertical partitioning is applied. The columns are split into several column chunks. The data is stored as pages inside the column chunks. Each page contains the encoded data values and metadata. As we mentioned before, the metadata for the entire file is also stored in the footer of the file at the Row group level.
As the data is split into column chunks, adding new data by encoding the new values into a new chunk and file is also easy. The metadata is then updated for the affected files and row groups. Thus, we can say that Parquet is a flexible format.
Parquet natively supports the compression of data using page compression and dictionary encoding techniques. Let’s see a simple example of dictionary compression:
Note that in the above example, we see the IT division 4 times. So, while storing in the dictionary, the format encodes the data with another easy-to-store value (0,1,2…) along with the number of times it’s repeated continuously – IT, IT is changed to 0,2 to save more space. Querying compressed data takes less time.
Now that we have a fair idea of how the CSV and Parquet formats look like, its time for some statistics to compare both the formats:
|Row-based storage format.||A hybrid of Row-based and column-based storage formats.|
|It consumes a lot of space as no default compression option is available. For example, a 1TB file will occupy the same space when stored on Amazon S3 or any other cloud.||Compresses data while storing, thus consuming less space. A 1 TB file stored in Parquet format will take up only 130GB of space.|
|Query run time is slow because of the row-based search. For each column, every row of data has to be retrieved.||Query time is about 34 times faster because of the column-based storage and presence of metadata.|
|More data has to be scanned per query.||About 99% less data is scanned for the execution of the query, thus optimizing performance.|
|Most storage devices charge based on the storage space, so CSV format means the high storage cost.||Less storage cost as data is stored in compressed, encoded format.|
|File schema has to be either inferred (leading to errors) or supplied (tedious).||File schema is stored in the metadata.|
|The format is suitable for simple data types.||Parquet is suitable even for complex types like nested schemas, arrays, dictionaries.|
We have seen through examples that Parquet is more efficient than CSV in terms of cost, flexibility, and performance. It is an effective mechanism for storing and retrieving data, especially when the entire world is moving towards cloud storage and space optimization. All major platforms like Azure, AWS, and BigQuery support Parquet format.