Join Pyspark Dataframes where two lists share a value
I have two dataframes of the form
df1 =
+------+---------+
|group1| members|
+------+---------+
| 1|[a, b, c]|
| 2|[d, e, f]|
| 3|[g, h, i]|
+------+---------+
df2 =
+------+---------+
|group2| members|
+------+---------+
| 4|[s, t, d]|
| 5|[u, v, w]|
| 6|[x, y, b]|
+------+---------+
I would like to perform a join on these dataframes based on a condition when the members lists share a common value. For example, group2
would map onto df1
as:
+------+---------+------+
|group1| members|group2|
+------+---------+------+
| 1|[a, b, c]| 6|
| 2|[d, e, f]| 4|
| 3|[g, h, i]| |
+------+---------+------+
Is there an efficient method for this? At the moment I am just looping through the rows of df2
and using f.array_intersect()
to compare.
You can use a left join, the join condition is to use the size
function to determine that the intersection of df1 and df2 is greater than 0.
df2 = df2.toDF('group2', 'members2')
df = df1.join(df2, F.size(F.array_intersect(df1.members, df2.members2)) > 0, 'left').drop('members2')
df.show(truncate=False)