How to parse huge JSON file as stream in Json.NET?

I have a very, very large JSON file (1000+ MB) of identical JSON objects. For example:

[
    {
        "id": 1,
        "value": "hello",
        "another_value": "world",
        "value_obj": {
            "name": "obj1"
        },
        "value_list": [
            1,
            2,
            3
        ]
    },
    {
        "id": 2,
        "value": "foo",
        "another_value": "bar",
        "value_obj": {
            "name": "obj2"
        },
        "value_list": [
            4,
            5,
            6
        ]
    },
    {
        "id": 3,
        "value": "a",
        "another_value": "b",
        "value_obj": {
            "name": "obj3"
        },
        "value_list": [
            7,
            8,
            9
        ]

    },
    ...
]

Every single item in the root JSON list follows the same structure and thus would be individually deserializable. I already have the C# classes written to receive this data, and deserializing a JSON file containing a single object without the list works as expected.

At first, I tried to just directly deserialize my objects in a loop:

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (!sr.EndOfStream)
    {
        o = serializer.Deserialize<MyObject>(reader);
    }
}

This didn't work, threw an exception clearly stating that an object is expected, not a list. My understanding is that this command would just read a single object contained at the root level of the JSON file, but since we have a list of objects, this is an invalid request.

My next idea was to deserialize as a C# List of objects:

JsonSerializer serializer = new JsonSerializer();
List<MyObject> o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (!sr.EndOfStream)
    {
        o = serializer.Deserialize<List<MyObject>>(reader);
    }
}

This does succeed. However, it only somewhat reduces the issue of high RAM usage. In this case it does look like the application is deserializing items one at a time, and so is not reading the entire JSON file into RAM, but we still end up with a lot of RAM usage because the C# List object now contains all of the data from the JSON file in RAM. This has only displaced the problem.

I then decided to simply try taking a single character off the beginning of the stream (to eliminate the [) by doing sr.Read() before going into the loop. The first object then does read successfully, but subsequent ones do not, with an exception of "unexpected token". My guess is this is the comma and space between the objects throwing the reader off.

Simply removing square brackets won't work since the objects do contain a primitive list of their own, as you can see in the sample. Even trying to use }, as a separator won't work since, as you can see, there are sub-objects within the objects.

What my goal is, is to be able to read the objects from the stream one at a time. Read an object, do something with it, then discard it from RAM, and read the next object, and so on. This would eliminate the need to load either the entire JSON string or the entire contents of the data into RAM as C# objects.

What am I missing?


Solution 1:

This should resolve your problem. Basically it works just like your initial code except it's only deserializing object when the reader hits the { character in the stream and otherwise it's just skipping to the next one until it finds another start object token.

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}

Solution 2:

I think we can do better than the accepted answer, using more features of JsonReader to make a more generalized solution.

As a JsonReader consumes tokens from a JSON, the path is recorded in the JsonReader.Path property.

We can use this to precisely select deeply nested data from a JSON file, using regex to ensure that we're on the right path.

So, using the following extension method:

public static class JsonReaderExtensions
{
    public static IEnumerable<T> SelectTokensWithRegex<T>(
        this JsonReader jsonReader, Regex regex)
    {
        JsonSerializer serializer = new JsonSerializer();
        while (jsonReader.Read())
        {
            if (regex.IsMatch(jsonReader.Path) 
                && jsonReader.TokenType != JsonToken.PropertyName)
            {
                yield return serializer.Deserialize<T>(jsonReader);
            }
        }
    }
}

The data you are concerned with lies on paths:

[0]
[1]
[2]
... etc

We can construct the following regex to precisely match this path:

var regex = new Regex(@"^\[\d+\]$");

it now becomes possible to stream objects out of your data (without fully loading or parsing the entire JSON) as follows

IEnumerable<MyObject> objects = jsonReader.SelectTokensWithRegex<MyObject>(regex);

Or if we want to dig even deeper into the structure, we can be even more precise with our regex

var regex = new Regex(@"^\[\d+\]\.value$");
IEnumerable<string> objects = jsonReader.SelectTokensWithRegex<string>(regex);

to only extract value properties from the items in the array.

I've found this technique extremely useful for extracting specific data from huge (100 GiB) JSON dumps, directly from HTTP using a network stream (with low memory requirements and no intermediate storage required).