Generate a Spark StructType / Schema from a case class
If I wanted to create a StructType
(i.e. a DataFrame.schema
) out of a case class
, is there a way to do it without creating a DataFrame
? I can easily do:
case class TestCase(id: Long)
val schema = Seq[TestCase]().toDF.schema
But it seems overkill to actually create a DataFrame
when all I want is the schema.
(If you are curious, the reason behind the question is that I am defining a UserDefinedAggregateFunction
, and to do so you override a couple of methods that return StructTypes
and I use case classes.)
You can do it the same way SQLContext.createDataFrame
does it:
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[TestCase].dataType.asInstanceOf[StructType]
I know this question is almost a year old but I came across it and thought others who do also might want to know that I have just learned to use this approach:
import org.apache.spark.sql.Encoders
val mySchema = Encoders.product[MyCaseClass].schema
in case someone wants to do this for a custom Java bean:
ExpressionEncoder.javaBean(Event.class).schema().json()
Instead of manually reproducing the logic for creating the implicit Encoder
object that gets passed to toDF
, one can use that directly (or, more precisely, implicitly in the same way as toDF
):
// spark: SparkSession
import spark.implicits._
implicitly[Encoder[MyCaseClass]].schema
Unfortunately, this actually suffers from the same problem as using org.apache.spark.sql.catalyst
or Encoders
as in the other answers: the Encoder
trait is experimental.
How does this work? The toDF
method on Seq
comes from a DatasetHolder
, which is created via the implicit localSeqToDatasetHolder
that is imported via spark.implicits._
. That function is defined like:
implicit def localSeqToDatasetHolder[T](s: Seq[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
As you can see, it takes an implicit
Encoder[T]
argument, which, for a case class
, can be computed via newProductEncoder
(also imported via spark.implicits._
). We can reproduce this implicit logic to get an Encoder
for our case class, via the convenience scala.Predef.implicitly
(in scope by default, because it's from Predef
) that will just returns its requested implicit argument:
def implicitly[T](implicit e: T): T