How I can I lazily read multiple JSON values from a file/stream in Python?
I'd like to read multiple JSON objects from a file/stream in Python, one at a time. Unfortunately json.load()
just .read()
s until end-of-file; there doesn't seem to be any way to use it to read a single object or to lazily iterate over the objects.
Is there any way to do this? Using the standard library would be ideal, but if there's a third-party library I'd use that instead.
At the moment I'm putting each object on a separate line and using json.loads(f.readline())
, but I would really prefer not to need to do this.
Example Use
example.py
import my_json as json
import sys
for o in json.iterload(sys.stdin):
print("Working on a", type(o))
in.txt
{"foo": ["bar", "baz"]} 1 2 [] 4 5 6
example session
$ python3.2 example.py < in.txt
Working on a dict
Working on a int
Working on a int
Working on a list
Working on a int
Working on a int
Working on a int
JSON generally isn't very good for this sort of incremental use; there's no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.
The object per line solution that you're using is seen elsewhere too. Scrapy calls it 'JSON lines':
- https://docs.scrapy.org/en/latest/topics/exporters.html?highlight=exporters#jsonitemexporter
- http://www.enricozini.org/2011/tips/python-stream-json/
You can do it slightly more Pythonically:
for jsonline in f:
yield json.loads(jsonline) # or do the processing in this loop
I think this is about the best way - it doesn't rely on any third party libraries, and it's easy to understand what's going on. I've used it in some of my own code as well.
A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.
After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream
module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.
Example:
from splitstream import splitfile
for jsonstr in splitfile(sys.stdin, format="json")):
yield json.loads(jsonstr)
Sure you can do this. You just have to take to raw_decode
directly. This implementation loads the whole file into memory and operates on that string (much as json.load
does); if you have large files you can modify it to only read from the file as necessary without much difficulty.
import json
from json.decoder import WHITESPACE
def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
if isinstance(string_or_fp, file):
string = string_or_fp.read()
else:
string = str(string_or_fp)
decoder = cls(**kwargs)
idx = WHITESPACE.match(string, 0).end()
while idx < len(string):
obj, end = decoder.raw_decode(string, idx)
yield obj
idx = WHITESPACE.match(string, end).end()
Usage: just as you requested, it's a generator.
This is a pretty nasty problem actually because you have to stream in lines, but pattern match across multiple lines against braces, but also pattern match json. It's a sort of json-preparse followed by a json parse. Json is, in comparison to other formats, easy to parse so it's not always necessary to go for a parsing library, nevertheless, how to should we solve these conflicting issues?
Generators to the rescue!
The beauty of generators for a problem like this is you can stack them on top of each other gradually abstracting away the difficulty of the problem whilst maintaining laziness. I also considered using the mechanism for passing back values into a generator (send()) but fortunately found I didn't need to use that.
To solve the first of the problems you need some sort of streamingfinditer, as a streaming version of re.finditer. My attempt at this below pulls in lines as needed (uncomment the debug statement to see) whilst still returning matches. I actually then modified it slightly to yield non-matched lines as well as matches (marked as 0 or 1 in the first part of the yielded tuple).
import re
def streamingfinditer(pat,stream):
for s in stream:
# print "Read next line: " + s
while 1:
m = re.search(pat,s)
if not m:
yield (0,s)
break
yield (1,m.group())
s = re.split(pat,s,1)[1]
With that, it's then possible to match up until braces, account each time for whether the braces are balanced, and then return either simple or compound objects as appropriate.
braces='{}[]'
whitespaceesc=' \t'
bracesesc='\\'+'\\'.join(braces)
balancemap=dict(zip(braces,[1,-1,1,-1]))
bracespat='['+bracesesc+']'
nobracespat='[^'+bracesesc+']*'
untilbracespat=nobracespat+bracespat
def simpleorcompoundobjects(stream):
obj = ""
unbalanced = 0
for (c,m) in streamingfinditer(re.compile(untilbracespat),stream):
if (c == 0): # remainder of line returned, nothing interesting
if (unbalanced == 0):
yield (0,m)
else:
obj += m
if (c == 1): # match returned
if (unbalanced == 0):
yield (0,m[:-1])
obj += m[-1]
else:
obj += m
unbalanced += balancemap[m[-1]]
if (unbalanced == 0):
yield (1,obj)
obj=""
This returns tuples as follows:
(0,"String of simple non-braced objects easy to parse")
(1,"{ 'Compound' : 'objects' }")
Basically that's the nasty part done. We now just have to do the final level of parsing as we see fit. For example we can use Jeremy Roman's iterload function (Thanks!) to do parsing for a single line:
def streamingiterload(stream):
for c,o in simpleorcompoundobjects(stream):
for x in iterload(o):
yield x
Test it:
of = open("test.json","w")
of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 {
} 2
9 78
4 5 { "animals" : [ "dog" , "lots of mice" ,
"cat" ] }
""")
of.close()
// open & stream the json
f = open("test.json","r")
for o in streamingiterload(f.readlines()):
print o
f.close()
I get these results (and if you turn on that debug line, you'll see it pulls in the lines as needed):
[u'hello']
{u'goodbye': 1}
1
2
{}
2
9
78
4
5
{u'animals': [u'dog', u'lots of mice', u'cat']}
This won't work for all situations. Due to the implementation of the json
library, it is impossible to work entirely correctly without reimplementing the parser yourself.