Schema comparison of two dataframes in scala

Based on @Derek Kaknes's answer, here's the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, StructField}

def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
    df.schema.map { (structField: StructField) =>
      structField.name.toLowerCase -> (structField.dataType, structField.nullable)
    }.toMap
  }

// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
                        schema2: Map[String, (DataType, Boolean)]
                       ): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
  (schema1.keys ++ schema2.keys).
    map(_.toLowerCase).
    toList.distinct.
    flatMap { (columnName: String) =>
      val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
      val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)

      if (schema1FieldOpt == schema2FieldOpt) None
      else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
    }.toMap
}
  • getCleanedSchema method extracts information of interest - column datatype & nullability and returns a map of column name to tuple

  • getSchemaDifference method returns a map containing only those columns that differ in the two schemas. If a column is absent in one of the two schemas, then it's corresponding properties would be None


I've just had the exact same problem. When you read data from Hive the schema's StructField component will sometimes contain Hive metadata in the field metadata. You can't see it when printing the schemas because this field is not part of the toString definition.

Here is the solution I've decided to use, I just get a copy of the schema with an empty Metadata before comparing it :

schema.map(_.copy(metadata = Metadata.empty))