Converting Django QuerySet to pandas DataFrame

import pandas as pd
import datetime
from myapp.models import BlogPost

df = pd.DataFrame(list(BlogPost.objects.all().values()))
df = pd.DataFrame(list(BlogPost.objects.filter(date__gte=datetime.datetime(2012, 5, 1)).values()))

# limit which fields
df = pd.DataFrame(list(BlogPost.objects.all().values('author', 'date', 'slug')))

The above is how I do the same thing. The most useful addition is specifying which fields you are interested in. If it's only a subset of the available fields you are interested in, then this would give a performance boost I imagine.


Django Pandas solves this rather neatly: https://github.com/chrisdev/django-pandas/

From the README:

class MyModel(models.Model):
    full_name = models.CharField(max_length=25)
    age = models.IntegerField()
    department = models.CharField(max_length=3)
    wage = models.FloatField()

from django_pandas.io import read_frame
qs = MyModel.objects.all()
df = read_frame(qs)

Convert the queryset on values_list() will be more memory efficient than on values() directly. Since the method values() returns a queryset of list of dict (key:value pairs), values_list() only returns list of tuple (pure data). It will save about 50% memory, just need to set the column information when you call pd.DataFrame().

Method 1:
    queryset = models.xxx.objects.values("A","B","C","D")
    df = pd.DataFrame(list(queryset))  ## consumes much memory
    #df = pd.DataFrame.from_records(queryset) ## works but no much change on memory usage

Method 2:
    queryset = models.xxx.objects.values_list("A","B","C","D")
    df = pd.DataFrame(list(queryset), columns=["A","B","C","D"]) ## this will save 50% memory
    #df = pd.DataFrame.from_records(queryset, columns=["A","B","C","D"]) ##It does not work. Crashed with datatype is queryset not list.

I tested this on my project with >1 million rows data, the peak memory is reduced from 2G to 1G.


From the Django perspective (I'm not familiar with pandas) this is fine. My only concern is that if you have a very large number of records, you may run into memory problems. If this were the case, something along the lines of this memory efficient queryset iterator would be necessary. (The snippet as written might require some rewriting to allow for your smart use of .values()).