Filter spark DataFrame on string contains

Solution 1:

You can use contains (this works with an arbitrary sequence):

df.filter($"foo".contains("bar"))

like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence):

df.filter($"foo".like("bar"))

or rlike (like with Java regular expressions):

df.filter($"foo".rlike("bar"))

depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

In pyspark,SparkSql syntax:

where column_n like 'xyz%'

might not work.

Use:

where column_n RLIKE '^xyz'

This works perfectly fine.