What is the advantage of partitioning a delta / spark table by year / month / day, rather than just date?

In many data lakes I see that data is partitioned by year, then month, then day, for example:

year=2019 / month=05 / day=15

What is the advantage of doing this vs. simply partitioning by date? e.g.:

date=20190515

The only advantage I can think of is if, for example, analysts want to query all data for a particular month/year. If just partitioning on date, then they would have to write a query with a calculation on the partition key, such as below psuedocode:

SELECT * FROM myTable WHERE LEFT(date,4) = 2019

Would spark still be able to do partition pruning for queries like the above? Are there any other advantages I haven't considered to the more nested partition structure?

Thank you


I would argue it's a disadvantage! Because splitting the date parts makes it much harder to do date filtering. For example say you want to query the last 10 days of data which may cross month boundaries? With a single date value you can just run simple queries like

...where date >= current_date() - interval 10 days

and Spark will figure out the right partitions for you. Spark can also handle other date functions, like year(date) = 2019 or month(date) = 2 and again it will properly do the partition pruning for you.

I always encourage using a single date column for partitioning. Let Spark do the work.

Also, an important thing to keep in mind is that date format should be yyyy-MM-dd.