Is there a memory efficient and fast way to load big JSON files?

There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.

Update:

I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:

import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
    print prefix, the_type, value

where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.

The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.


So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:

  1. Modularize your code. Do something like:

    for json_file in list_of_files:
        process_file(json_file)
    

    If you write process_file() in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.

  2. Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via subprocess.Popen. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.

Hope this helps.


Yes.

You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.


It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below

[{"name": "rantidine",  "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip",  "drug": {"type": "capsule", "content_type": "solid"}}]

You can print every element of the array using the below method

 def extract_json(filename):
      with open(filename, 'rb') as input_file:
          jsonobj = ijson.items(input_file, 'item')
          jsons = (o for o in jsonobj)
          for j in jsons:
             print(j)

Note: 'item' is the default prefix given by ijson.

if you want to access only specific json's based on a condition you can do it in following way.

def extract_tabtype(filename):
    with open(filename, 'rb') as input_file:
        objects = ijson.items(input_file, 'item.drugs')
        tabtype = (o for o in objects if o['type'] == 'tablet')
        for prop in tabtype:
            print(prop)

This will print only those json whose type is tablet.


On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.