What is the advantage of partitioning a delta / spark table by year / month / day, rather than just date?

In many data lakes I see that data is partitioned by year, then month, then day, for example:

year=2019 / month=05 / day=15

What is the advantage of doing this vs. simply partitioning by date? e.g.:

date=20190515

The only advantage I can think of is if, for example, analysts want to query all data for a particular month/year. If just partitioning on date, then they would have to write a query with a calculation on the partition key, such as below psuedocode:

SELECT * FROM myTable WHERE LEFT(date,4) = 2019

Would spark still be able to do partition pruning for queries like the above? Are there any other advantages I haven't considered to the more nested partition structure?

Thank you

I would argue it's a disadvantage! Because splitting the date parts makes it much harder to do date filtering. For example say you want to query the last 10 days of data which may cross month boundaries? With a single date value you can just run simple queries like

...where date >= current_date() - interval 10 days

and Spark will figure out the right partitions for you. Spark can also handle other date functions, like year(date) = 2019 or month(date) = 2 and again it will properly do the partition pruning for you.

I always encourage using a single date column for partitioning. Let Spark do the work.

Also, an important thing to keep in mind is that date format should be yyyy-MM-dd.

Spring Boot 2.5.0 and InvalidDefinitionException: Java 8 date/time type `java.time.Instant` not supported by default

Retrieving custom field value from Checkout page

Terraform: AWS Codepipeline multiple Codecommit sources

How to remove the extra curly bracket while using jinja2 if condition in Ansible

Inventor Design Automation with Modelstates not working

How do you display a particular element on hover?

Getting "I won't open a connection to" when connecting to FTP server from Google Compute Engine

How to use variables from other modules in Terraform: Adding Host Project id to the Service Projects. (GCP)

azure bot unable to find client secret

Merge R data frame or data table and overwrite values of multiple columns

put dash after every n character during input from keyboard

Filtering a pandas column which contains a list in other column?

What is the advantage of partitioning a delta / spark table by year / month / day, rather than just date?

Related

Recent Posts