What are the pros and cons of parquet format compared to other formats?
I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we're all used to -- text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other tricks of various formats (especially including compression) involve whether a format can be split -- that is, can you read a block of records from anywhere in the dataset and still know it's schema? But here's more detail on columnar formats like Parquet.
Parquet, and other columnar formats handle a common Hadoop situation very efficiently. It is common to have tables (datasets) having many more columns than you would expect in a well-designed relational database -- a hundred or two hundred columns is not unusual. This is so because we often use Hadoop as a place to denormalize data from relational formats -- yes, you get lots of repeated values and many tables all flattened into a single one. But it becomes much easier to query since all the joins are worked out. There are other advantages such as retaining state-in-time data. So anyway it's common to have a boatload of columns in a table.
Let's say there are 132 columns, and some of them are really long text fields, each different column one following the other and use up maybe 10K per record.
While querying these tables is easy with SQL standpoint, it's common that you'll want to get some range of records based on only a few of those hundred-plus columns. For example, you might want all of the records in February and March for customers with sales > $500.
To do this in a row format the query would need to scan every record of the dataset. Read the first row, parse the record into fields (columns) and get the date and sales columns, include it in your result if it satisfies the condition. Repeat. If you have 10 years (120 months) of history, you're reading every single record just to find 2 of those months. Of course this is a great opportunity to use a partition on year and month, but even so, you're reading and parsing 10K of each record/row for those two months just to find whether the customer's sales are > $500.
In a columnar format, each column (field) of a record is stored with others of its kind, spread all over many different blocks on the disk -- columns for year together, columns for month together, columns for customer employee handbook (or other long text), and all the others that make those records so huge all in their own separate place on the disk, and of course columns for sales together. Well heck, date and months are numbers, and so are sales -- they are just a few bytes. Wouldn't it be great if we only had to read a few bytes for each record to determine which records matched our query? Columnar storage to the rescue!
Even without partitions, scanning the small fields needed to satisfy our query is super-fast -- they are all in order by record, and all the same size, so the disk seeks over much less data checking for included records. No need to read through that employee handbook and other long text fields -- just ignore them. So, by grouping columns with each other, instead of rows, you can almost always scan less data. Win!
But wait, it gets better. If your query only needed to know those values and a few more (let's say 10 of the 132 columns) and didn't care about that employee handbook column, once it had picked the right records to return, it would now only have to go back to the 10 columns it needed to render the results, ignoring the other 122 of the 132 in our dataset. Again, we skip a lot of reading.
(Note: for this reason, columnar formats are a lousy choice when doing straight transformations, for example, if you're joining all of two tables into one big(ger) result set that you're saving as a new table, the sources are going to get scanned completely anyway, so there's not a lot of benefit in read performance, and because columnar formats need to remember more about the where stuff is, they use more memory than a similar row format).
One more benefit of columnar: data is spread around. To get a single record, you can have 132 workers each read (and write) data from/to 132 different places on 132 blocks of data. Yay for parallelization!
And now for the clincher: compression algorithms work much better when it can find repeating patterns. You could compress AABBBBBBCCCCCCCCCCCCCCCC
as 2A6B16C
but ABCABCBCBCBCCCCCCCCCCCCCC
wouldn't get as small (well, actually, in this case it would, but trust me :-) ). So once again, less reading. And writing too.
So we read a lot less data to answer common queries, it's potentially faster to read and write in parallel, and compression tends to work much better.
Columnar is great when your input side is large, and your output is a filtered subset: from big to little is great. Not as beneficial when the input and outputs are about the same.
But in our case, Impala took our old Hive queries that ran in 5, 10, 20 or 30 minutes, and finished most in a few seconds or a minute.
Hope this helps answer at least part of your question!
Avro is a row-based storage format for Hadoop.
Parquet is a column-based storage format for Hadoop.
If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.
If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.
Source
Tom's answer is quite detailed and exhaustive but you may also be interested in this simple study about Parquet vs Avro done at Allstate Insurance, summarized here:
"Overall, Parquet showed either similar or better results on every test [than Avro]. The query-performance differences on the larger datasets in Parquet’s favor are partly due to the compression results; when querying the wide dataset, Spark had to read 3.5x less data for Parquet than Avro. Avro did not perform well when processing the entire dataset, as suspected."
Choosing the right file format is important to building performant data applications. The concepts outlined in this post carry over to Pandas, Dask, Spark, and Presto / AWS Athena.
Column pruning
Column pruning is a big performance improvement that's possible for column-based file formats (Parquet, ORC) and not possible for row-based file formats (CSV, Avro).
Suppose you have a dataset with 100 columns and want to read two of them into a DataFrame. Here's how you can perform this with Pandas if the data is stored in a Parquet file.
import pandas as pd
pd.read_parquet('some_file.parquet', columns = ['id', 'firstname'])
Parquet is a columnar file format, so Pandas can grab the columns relevant for the query and can skip the other columns. This is a massive performance improvement.
If the data is stored in a CSV file, you can read it like this:
import pandas as pd
pd.read_csv('some_file.csv', usecols = ['id', 'firstname'])
usecols
can't skip over entire columns because of the row nature of the CSV file format.
Spark doesn't require users to explicitly list the columns that'll be used in a query. Spark builds up an execution plan and will automatically leverage column pruning whenever possible. Of course, column pruning is only possible when the underlying file format is column oriented.
Popularity
Spark and Pandas have built-in readers writers for CSV, JSON, ORC, Parquet, and text files. They don't have built-in readers for Avro.
Avro is popular within the Hadoop ecosystem. Parquet has gained significant traction outside of the Hadoop ecosystem. For example, the Delta Lake project is being built on Parquet files.
Arrow is an important project that makes it easy to work with Parquet files with a variety of different languages (C, C++, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust), but doesn't support Avro. Parquet files are easier to work with because they are supported by so many different projects.
Schema
Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.
Avro also stores the data schema in the file itself. Having schema in the files is a huge advantage and is one of the reasons why a modern data project should not rely on JSON or CSV.
Column metadata
Parquet stores metadata statistics for each column and lets users add their own column metadata as well.
The min / max column value metadata allows for Parquet predicate pushdown filtering that's supported by the Dask & Spark cluster computing frameworks.
Here's how to fetch the column statistics with PyArrow.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)
<pyarrow._parquet.Statistics object at 0x11ac17eb0>
has_min_max: True
min: 1
max: 9
null_count: 0
distinct_count: 0
num_values: 3
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Complex column types
Parquet allows for complex column types like arrays, dictionaries, and nested schemas. There isn't a reliable method to store complex types in simple file formats like CSVs.
Compression
Columnar file formats store related types in rows, so they're easier to compress. This CSV file is relatively hard to compress.
first_name,age
ken,30
felicia,36
mia,2
This data is easier to compress when the related types are stored in the same row:
ken,felicia,mia
30,36,2
Parquet files are most commonly compressed with the Snappy compression algorithm. Snappy compressed files are splittable and quick to inflate. Big data systems want to reduce file size on disk, but also want to make it quick to inflate the flies and run analytical queries.
Mutable nature of file
Parquet files are immutable, as described here. CSV files are mutable.
Adding a row to a CSV file is easy. You can't easily add a row to a Parquet file.
Data lakes
In a big data environment, you'll be working with hundreds or thousands of Parquet files. Disk partitioning of the files, avoiding big files, and compacting small files is important. The optimal disk layout of data depends on your query patterns.