Memory Efficient Alternatives to Python Dictionaries
Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:
295760 S. Lott's answer
237984 S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156 d[a][b][c] = int(freq)
189132 keys.append((a,b,c)); freqs.append(int(freq))
146132 d[intern(a),intern(b)][intern(c)] = int(freq)
145408 d[intern(a)][intern(b)][intern(c)] = int(freq)
83888 [*] d[a+' '+b+' '+c] = int(freq)
82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
68756 keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
60320 keys.append(a+' '+b+' '+c); freqs.append(int(freq))
50556 pair array
48320 squeezed pair array
33024 squeezed single array
The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)
'Pair array' is the scheme below in my original answer ("I'd start with the array with keys being the first two words..."), where the value table for each pair is represented as a single string. 'Squeezed pair array' is the same, leaving out the frequency values that are equal to 1 (the most common case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:
import collections
def build(file):
pairs = collections.defaultdict(list)
for line in file: # N.B. file assumed to be already sorted
a, b, c, freq = line.split()
key = ' '.join((a, b))
pairs[key].append(c + ':' + freq if freq != '1' else c)
out = open('squeezedsinglearrayfile', 'w')
for key in sorted(pairs.keys()):
out.write('%s|%s\n' % (key, ' '.join(pairs[key])))
def load():
return open('squeezedsinglearrayfile').readlines()
if __name__ == '__main__':
build(open('freqs'))
I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.
Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.
Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.
I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)
Use tuples.
Tuples can be keys to dictionaries, so you don't need to nest dictionaries.
d = {}
d[ word1, word2, word3 ] = 1
Also as a plus, you could use defaultdict
- so that elements that don't have entries always return 0
- and so that u can say
d[w1,w2,w3] += 1
without checking if the key already exists or not
example:
from collections import defaultdict
d = defaultdict(int)
d["first","word","tuple"] += 1
If you need to find all words "word3" that are tupled with (word1,word2) then search for it in dictionary.keys() using list comprehension
if you have a tuple, t, you can get the first two items using slices:
>>> a = (1,2,3)
>>> a[:2]
(1, 2)
a small example for searching tuples with list comprehensions:
>>> b = [(1,2,3),(1,2,5),(3,4,6)]
>>> search = (1,2)
>>> [a[2] for a in b if a[:2] == search]
[3, 5]
You see here, we got a list of all items that appear as the third item in the tuples that start with (1,2)
In this case, ZODB¹ BTrees might be helpful, since they are much less memory-hungry. Use a BTrees.OOBtree (Object keys to Object values) or BTrees.OIBTree (Object keys to Integer values), and use 3-word tuples as your key.
Something like:
from BTrees.OOBTree import OOBTree as BTree
The interface is, more or less, dict-like, with the added bonus (for you) that .keys
, .items
, .iterkeys
and .iteritems
have two min, max
optional arguments:
>>> t=BTree()
>>> t['a', 'b', 'c']= 10
>>> t['a', 'b', 'z']= 11
>>> t['a', 'a', 'z']= 12
>>> t['a', 'd', 'z']= 13
>>> print list(t.keys(('a', 'b'), ('a', 'c')))
[('a', 'b', 'c'), ('a', 'b', 'z')]
¹ Note that if you are on Windows and work with Python >2.4, I know there are packages for more recent python versions, but I can't recollect where.
PS They exist in the CheeseShop ☺
A couple attempts:
I figure you're doing something similar to this:
from __future__ import with_statement
import time
from collections import deque, defaultdict
# Just used to generate some triples of words
def triplegen(words="/usr/share/dict/words"):
d=deque()
with open(words) as f:
for i in range(3):
d.append(f.readline().strip())
while d[-1] != '':
yield tuple(d)
d.popleft()
d.append(f.readline().strip())
if __name__ == '__main__':
class D(dict):
def __missing__(self, key):
self[key] = D()
return self[key]
h=D()
for a, b, c in triplegen():
h[a][b][c] = 1
time.sleep(60)
That gives me ~88MB.
Changing the storage to
h[a, b, c] = 1
takes ~25MB
interning a, b, and c makes it take about 31MB. My case is a bit special because my words never repeat on the input. You might try some variations yourself and see if one of these helps you.