Best way to join two large datasets in Pandas
I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.
What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.
To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a
pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')
Solution 1:
This seems like a task that dask
was designed for. Essentially, dask
can do pandas
operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe
API is a subset of the pandas
API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.
import dask.dataframe as dd
# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')
# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])
# Write the output.
df.to_csv('file3.csv', index=False)
Assuming that 'product'
and 'version'
are the only columns, it may be more efficient to replace the merge
with:
df = dd.concat([df1, df2]).drop_duplicates()
I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask
, so it could be worth a try.
Solution 2:
I would recommend you to use RDBMS like MySQL for that...
So you would need to load your CSV files into tables first.
After that you can perform your checks:
which products and versions are in the left table only
SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;
which products and versions are in the right table only
SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;
in both
SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;
Configure your MySQL Server, so that it uses at least 2GB of RAM
You may also want to use MyISAM engine for your tables, in this case check this
It might work slower compared to Pandas, but you definitely won't have memory issues.
Another possible solutions:
- increase your RAM
- use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM