Reading JSON with Apache Spark - `corrupt_record`

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:

val df = spark.read.option("multiline", "true").json("<file>")

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is called JSONL. Basically it's an alternative to CSV.

To read the multi-line JSON as a DataFrame:

val spark = SparkSession.builder().getOrCreate()

val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)

Reading large files in this manner is not recommended, from the wholeTextFiles docs

Small files are preferred, large file is also allowable, but may cause bad performance.

Reading JSON with Apache Spark - `corrupt_record`

Related

Recent Posts