Removing duplicate columns after a DF join in Spark

Solution 1:

If the join columns at both data frames have the same names and you only need equi join, you can specify the join columns as a list, in which case the result will only keep one of the join columns:
| id|val1|
|  1|   2|
|  2|   3|
|  4|   4|
|  5|   5|
| id|val2|
|  1|   2|
|  1|   3|
|  2|   4|
|  3|   5|

df1.join(df2, ['id']).show()
| id|val1|val2|
|  1|   2|   2|
|  1|   2|   3|
|  2|   3|   4|

Otherwise you need to give the join data frames alias and refer to the duplicated columns by the alias later:

    df2.alias("b"), df1['id'] == df2['id']
).select("", "a.val1", "b.val2").show()
| id|val1|val2|
|  1|   2|   2|
|  1|   2|   3|
|  2|   3|   4|

Solution 2:

df.join(other, on, how) when on is a column name string, or a list of column names strings, the returned dataframe will prevent duplicate columns. when on is a join expression, it will result in duplicate columns. We can use .drop(df.a) to drop duplicate columns. Example:

cond = [df.a == other.a, df.b ==, df.c == other.ccc]
# result will have duplicate column a
result = df.join(other, cond, 'inner').drop(df.a)