Wrong sequence of months in PySpark sequence interval month

I am trying to create an array of dates that all months from a minimum date to a maximum date!

Example:

min_date = "2021-05-31"
max_date = "2021-11-30"

.withColumn('array_date', F.expr('sequence(to_date(min_date), to_date(max_date), interval 1 month)')

But it gives me the following Output:

['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31']

Why doesn't the upper limit appear on 11/30/2021? In the documentation, it says that the extremes are included.

My desired output is:

['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30']

Thank you!


Solution 1:

I think this is related to the timezone. I can reproduce the same behavior in my timezone Europe/Paris but when setting timezone to UTC it gives expected result:

from pyspark.sql import functions as F

spark.conf.set("spark.sql.session.timeZone", "UTC")

df = spark.createDataFrame([("2021-05-31", "2021-11-30")], ["min_date", "max_date"])

df.withColumn(
    "array_date",
    F.expr("sequence(to_date(min_date), to_date(max_date), interval 1 month)")
).show(truncate=False)

#+----------+----------+------------------------------------------------------------------------------------+
#|min_date  |max_date  |array_date                                                                          |
#+----------+----------+------------------------------------------------------------------------------------+
#|2021-05-31|2021-11-30|[2021-05-31, 2021-06-30, 2021-07-31, 2021-08-31, 2021-09-30, 2021-10-31, 2021-11-30]|
#+----------+----------+------------------------------------------------------------------------------------+

Alternatively, you can use TimestampType for start and end parameters of the sequence instead of DateType:

df.withColumn(
    "array_date",
    F.expr("sequence(to_timestamp(min_date), to_timestamp(max_date), interval 1 month)").cast("array<date>")
).show(truncate=False)