How to check if at least one element of a list is included in a text column?

Solution 1:

You can do that using the built in rlike function with the following code.

from pyspark.sql import functions

test_df = (test_df.withColumn("text_contains_word", 
                                functions.col('text')
                                 .rlike('(^|\s)(' + '|'.join(test_keywords) 
                                                    + ')(\s|$)')))

test_df.show()
+---+--------------------+------------------+
| id|                text|text_contains_word|
+---+--------------------+------------------+
|  1|i like stackoverflow|             false|
|  2|tomorrow the sun ...|              true|
+---+--------------------+------------------+