Is it better to have one large parquet file or lots of smaller parquet files?

Solution 1:

Aim for around 1GB per file (spark partition) (1).

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2).

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function. Warning: repartition especially but also coalesce can cause a reshuffle of the data, so use with some caution.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 2) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

Solution 2:

Notice that Parquet files are internally split into row groups

parquet layout

https://parquet.apache.org/documentation/latest/

So by making parquet files larger, row groups can still be the same if your baseline parquet files were not small/tiny. There is no huge direct penalty on processing, but opposite, there are more opportunities for readers to take advantage of perhaps larger/ more optimal row groups if your parquet files were smaller/tiny for example as row groups can't span multiple parquet files.

Also larger parquet files don't limit parallelism of readers, as each parquet file can be broken up logically into multiple splits (consisting of one or more row groups).

The only downside of larger parquet files is it takes more memory to create them. So you can watch out if you need to bump up Spark executors' memory.

row groups are a way for Parquet files to have vertical partitioning. Each row group has many row chunks (one for each column, a way to provide horizontal partitioning for the datasets in parquet).