Regular expressions in Pyspark
I was reading the book "the Spark Definitive guide" and while doing a code example I couldn't understand the logic completely. Below is the code from the book.
simpleColors = ["black", "white", "green", "blue", "red" ]
def color_locator(column, color_string):
return locate(color_string.upper(), column).cast("boolean").alias("is_" + color_string)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]
selectedColumns.append(expr("*"))
df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)
I don't understand the line selectedColumns.append(expr("*"))
in the code. What does this accomplish . In the book it says that to make sure selectedColumns has to be a Column type we need to do this. It is complete bouncer for me. And in the next statement we are using df.select(*selectedColumns)
. Why we need the * expression at the first place? Please help me resolve the confusion
Let me try to break it down so that you can understand what is happening:
selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]
In this line we are iterating over the colors in simpleColors and creating a list of selectedColumns. At this point in time, selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red". Notice how this doesn't contain the Description column.
The next line,
selectedColumns.append(expr("*"))
Is basically adding every column in the original dataframe to this list of selectedColumns (this is a shorthand instead of adding every column explicitly).
At this point selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red", "*"
df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)
in this line *selectedColumns means that we are passing a variable number of arguments you can read more about it here: https://www.geeksforgeeks.org/args-kwargs-python/
to summarize we are selecting the columns, is_black, is_green, is_blue, is_red and * from the original dataframe (df).