Overwrite only some partitions in a partitioned spark Dataset

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.

Solution 1:

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example in scala:

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

Solution 2:

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

from the source code:

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here

Visual Studio always selects the wrong xsd for App.config

Granting access to Firebase locations to a group of users

Feeding a Python list into a function taking in a vector with Boost Python

How to decrypt message with CryptoJS AES. I have a working Ruby example

What is the difference between a domain class diagram and a design class diagram?

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

Browse Web Site With IP Address Rather than localhost

Storing in JobExecutionContext from tasklet and accessing in another tasklet

Programmatically Configure SSL for Jetty 9 embedded

Reduce the gutter (default 30px) on smaller devices in Bootstrap3?

Is it possible to use sun.misc.Unsafe to call C functions without JNI?

Programming to interfaces while mapping with Fluent NHibernate

Overwrite only some partitions in a partitioned spark Dataset

Solution 1:

Solution 2:

Related

Recent Posts