Schema comparison of two dataframes in scala
Based on @Derek Kaknes's answer, here's the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, StructField}
def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
df.schema.map { (structField: StructField) =>
structField.name.toLowerCase -> (structField.dataType, structField.nullable)
}.toMap
}
// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
schema2: Map[String, (DataType, Boolean)]
): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
(schema1.keys ++ schema2.keys).
map(_.toLowerCase).
toList.distinct.
flatMap { (columnName: String) =>
val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)
if (schema1FieldOpt == schema2FieldOpt) None
else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
}.toMap
}
-
getCleanedSchema
method extracts information of interest - column datatype & nullability and returns amap
of column name totuple
-
getSchemaDifference
method returns amap
containing only those columns that differ in the two schemas. If a column is absent in one of the two schemas, then it's corresponding properties would beNone
I've just had the exact same problem. When you read data from Hive the schema's StructField
component will sometimes contain Hive metadata in the field metadata
.
You can't see it when printing the schemas because this field is not part of the toString
definition.
Here is the solution I've decided to use, I just get a copy of the schema with an empty Metadata before comparing it :
schema.map(_.copy(metadata = Metadata.empty))