How to convert Row of a Scala DataFrame into case class most efficiently?
Solution 1:
DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark
val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
// Create an Encoders for Java class (In my eg. Person is a JAVA class)
// For scala case class you can pass Person without .class reference
val personEncoder = Encoders.bean(Person.class)
val DStoProcess = DFtoProcess.as[Person](personEncoder)
Now, Spark converts the Dataset[Row] -> Dataset[Person]
type-specific Scala / Java JVM object, as dictated by the class Person.
Please refer to below link provided by databricks for further details
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Solution 2:
As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like
map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2))
I find this to be easier, especially if the case class constructor only needs some of the fields from the row.
Solution 3:
scala> import spark.implicits._
scala> val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> case class Student(id: Int, name: String)
defined class Student
scala> df.as[Student].collectAsList
res6: java.util.List[Student] = [Student(1,james), Student(2,tony)]
Here the spark
in spark.implicits._
is your SparkSession
. If you are inside the REPL the session is already defined as spark
otherwise you need to adjust the name accordingly to correspond to your SparkSession
.
Solution 4:
Of course you can match a Row object into a case class. Let's suppose your SchemaType has many fields and you want to match a few of them into your case class. If you don't have null fields you can simply do:
case class MyClass(a: Long, b: String, c: Int, d: String, e: String)
dataframe.map {
case Row(a: java.math.BigDecimal,
b: String,
c: Int,
_: String,
_: java.sql.Date,
e: java.sql.Date,
_: java.sql.Timestamp,
_: java.sql.Timestamp,
_: java.math.BigDecimal,
_: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}
This approach will fail in case of null values and also require you do explicitly define the type of each single field. If you have to handle null values you need to either discard all the rows containing null values by doing
dataframe.na.drop()
That will drop records even if the null fields are not the ones used in your pattern matching for your case class. Or if you want to handle it you could turn the Row object into a List and then use the option pattern:
case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)
dataframe.map(_.toSeq.toList match {
case List(a: java.math.BigDecimal,
b: String,
c: Int,
_: String,
_: java.sql.Date,
e: java.sql.Date,
_: java.sql.Timestamp,
_: java.sql.Timestamp,
_: java.math.BigDecimal,
_: String) => MyClass(
a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}
Check this github project Sparkz () which will soon introduce a lot of libraries for simplifying the Spark and DataFrame APIs and make them more functional programming oriented.