Reading JSON with Apache Spark - `corrupt_record`

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:

val df = spark.read.option("multiline", "true").json("<file>")

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is called JSONL. Basically it's an alternative to CSV.


To read the multi-line JSON as a DataFrame:

val spark = SparkSession.builder().getOrCreate()

val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)

Reading large files in this manner is not recommended, from the wholeTextFiles docs

Small files are preferred, large file is also allowable, but may cause bad performance.