_corrupt_record error when reading a JSON file into Spark

I've got this JSON file

{
    "a": 1, 
    "b": 2
}

which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json')

print df.show()

The print statement spits out this though:

+---------------+
|_corrupt_record|
+---------------+
|              {|
|       "a": 1, |
|         "b": 2|
|              }|
+---------------+

Anyone knows what's going on and why it is not interpreting the file correctly?


If you want to leave your JSON file as it is (without stripping new lines characters \n), include multiLine=True keyword argument

sc = SparkContext() 
sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json', multiLine=True)

print df.show()

You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

If your json file looks like this it will give you the expected dataframe:

{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }

....
df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

In Spark 2.2+ you can read json file of multiline using following command.

val dataframe = spark.read.option("multiline",true).json( " filePath ")

if there is json object per line then,

val dataframe = spark.read.json(filepath)

Adding to @Bernhard's great answer

# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
    js = json.load(jsonfile)      

# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
    for d in js:
        json.dump(d, outfile)
        outfile.write('\n')