Read parquet with binary (proto-buffer) column

I have a parquet file, and each columns are different serialized proto-buffer. When I am trying to by

Dataset<Row> df = spark.read().parquet("test.parquet"); 
df.printSchema();

I got

root
   |-- A: binary (nullable = true)
   |-- B: binary (nullable = true)

But I would like to see

root
   |-- A: struct
        |-- a: string
        |-- b: int
   |-- B: struct
        |-- c: int
        |-- d: string

Is anyway I can read the parquet file, and deserialize each binary column into struct? I was thinking to use Apache Parquet Proto, but isn't the case for me, because it assumes the columns of the parquet file are already in struct format, not binary.


Solution 1:

There must be a more simpler approach but here is what I have in mind,

Perhaps you can define the scala case classes for your ProtoBuff model, if your model is more complex you can use something like scalaPB to convert proto files to scala case classes

case class A(a:String,b:Int)
case class B(c:String,d:Int)

write a UDF that converts binary to appropriate model

val toA = udf { (values: Array[Byte]) =>
  val aRow = ModelA.parseFrom(values);
  A(aRow.a,aRow.b)
}

Then you can parse your binary column with the above udf

df.select(toA($"A")).show()

I haven't tested this code but it should work, let me know if you encounter any problem.