Searching for a string in a large text file - profiling various methods in python

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.

Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.

SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what @alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.


Custom hash table search with externalized strings

To get fast access time and a lower memory consumption you could do the following:

  • for each line compute a string hash and add it to a hash table, e.g., index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list.
  • to look up a string, compute its hash and look it up in the table. If the key is found, read the string at position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.

Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)

Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:

from collections import namedtuple 
Node = namedtuple('Node', ['pos', 'next'])

def build_table(f, size):
    table = [ None ] * size
    while True:
        pos = f.tell()
        line = f.readline()
        if not line: break
        i = hash(line) % size
        if table[i] is None:
            table[i] = pos
        else:
            table[i] = Node(pos, table[i])
    return table

def search(string, table, f):
    i = hash(string) % len(table)
    entry = table[i]
    while entry is not None:
        pos = entry.pos if isinstance(entry, Node) else entry
        f.seek(pos)
        if f.readline() == string:
            return True
        entry = entry.next if isinstance(entry, Node) else None
    return False

SIZE = 2**24
with open('data.txt', 'r') as f:
    table = build_table(f, SIZE)
    print search('Some test string\n', table, f)

The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).

For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)

Hash table: 128MB (sys.getsizeof([None]*(2**24)))
Nodes:       64MB (sys.getsizeof(Node(None, None)) * 933132)
Pos ints:   138MB (6000000 * 24)
-----------------
TOTAL:      330MB (real memory usage of python process was ~350MB)