How do I detect if a Spark DataFrame has a column
When I create a DataFrame
from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select
Example JSON schema:
{
"a": {
"b": 1,
"c": 2
}
}
This is what I want to do:
potential_columns = Seq("b", "c", "d")
df = sqlContext.read.json(filename)
potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))
but I can't find a good function for hasColumn
. The closest I've gotten is to test if the column is in this somewhat awkward array:
scala> df.select("a.*").columns
res17: Array[String] = Array(b, c)
Just assume it exists and let it fail with Try
. Plain and simple and supports an arbitrary nesting:
import scala.util.Try
import org.apache.spark.sql.DataFrame
def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess
val df = sqlContext.read.json(sc.parallelize(
"""{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))
hasColumn(df, "foobar")
// Boolean = false
hasColumn(df, "foo")
// Boolean = true
hasColumn(df, "foo.bar")
// Boolean = true
hasColumn(df, "foo.bar.foobar")
// Boolean = true
hasColumn(df, "foo.bar.foobaz")
// Boolean = false
Or even simpler:
val columns = Seq(
"foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")
columns.flatMap(c => Try(df(c)).toOption)
// Seq[org.apache.spark.sql.Column] = List(
// foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)
Python equivalent:
from pyspark.sql.utils import AnalysisException
from pyspark.sql import Row
def has_column(df, col):
try:
df[col]
return True
except AnalysisException:
return False
df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()
has_column(df, "foobar")
## False
has_column(df, "foo")
## True
has_column(df, "foo.bar")
## True
has_column(df, "foo.bar.foobar")
## True
has_column(df, "foo.bar.foobaz")
## False
Another option which I normally use is
df.columns.contains("column-name-to-check")
This returns a boolean
Actually you don't even need to call select in order to use columns, you can just call it on the dataframe itself
// define test data
case class Test(a: Int, b: Int)
val testList = List(Test(1,2), Test(3,4))
val testDF = sqlContext.createDataFrame(testList)
// define the hasColumn function
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) = df.columns.contains(colName)
// then you can just use it on the DF with a given column name
hasColumn(testDF, "a") // <-- true
hasColumn(testDF, "c") // <-- false
Alternatively you can define an implicit class using the pimp my library pattern so that the hasColumn method is available on your dataframes directly
implicit class DataFrameImprovements(df: org.apache.spark.sql.DataFrame) {
def hasColumn(colName: String) = df.columns.contains(colName)
}
Then you can use it as:
testDF.hasColumn("a") // <-- true
testDF.hasColumn("c") // <-- false
Try
is not optimal as the it will evaluate the expression inside Try
before it takes the decision.
For large data sets, use the below in Scala
:
df.schema.fieldNames.contains("column_name")