Regular expressions in Pyspark

I was reading the book "the Spark Definitive guide" and while doing a code example I couldn't understand the logic completely. Below is the code from the book.

simpleColors = ["black", "white", "green", "blue", "red" ]
def color_locator(column, color_string):
        return locate(color_string.upper(), column).cast("boolean").alias("is_" + color_string)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]
selectedColumns.append(expr("*"))
df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)

I don't understand the line selectedColumns.append(expr("*")) in the code. What does this accomplish . In the book it says that to make sure selectedColumns has to be a Column type we need to do this. It is complete bouncer for me. And in the next statement we are using df.select(*selectedColumns) . Why we need the * expression at the first place? Please help me resolve the confusion

Let me try to break it down so that you can understand what is happening:

selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]

In this line we are iterating over the colors in simpleColors and creating a list of selectedColumns. At this point in time, selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red". Notice how this doesn't contain the Description column.

The next line,

selectedColumns.append(expr("*"))

Is basically adding every column in the original dataframe to this list of selectedColumns (this is a shorthand instead of adding every column explicitly).

At this point selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red", "*"

df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)

in this line *selectedColumns means that we are passing a variable number of arguments you can read more about it here: https://www.geeksforgeeks.org/args-kwargs-python/

to summarize we are selecting the columns, is_black, is_green, is_blue, is_red and * from the original dataframe (df).

Regular expressions in Pyspark

Related

Recent Posts