Pandas read_stata() with large .dta files

Solution 1:

Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:

def load_large_dta(fname):
    import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print '.',
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print '\nloaded {} rows'.format(len(df))

    return df

I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.

This notebook shows it in action.

Solution 2:

For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.

Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)

Solution 3:

There is a simpler way to solve it using Pandas' built-in function read_stata.

Assume your large file is named as large.dta.

import pandas as pd

reader=pd.read_stata("large.dta",chunksize=100000)

df = pd.DataFrame()

for itm in reader:
    df=df.append(itm)

df.to_csv("large.csv")