在 Pandas 中加入两个大型数据集的最佳方式 [英] Best way to join two large datasets in Pandas
问题描述
我正在从需要连接的两个不同数据库下载两个数据集.当我将它们存储为 CSV 时,它们中的每一个分别约为 500MB.单独装入内存,但是当我同时加载两者时,有时会出现内存错误.当我尝试将它们与熊猫合并时,我肯定会遇到麻烦.
I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.
对它们进行外连接以防止出现内存错误的最佳方法是什么?我手头没有任何数据库服务器,但如果有帮助,我可以在我的计算机上安装任何类型的开源软件.理想情况下,我仍然只想在 Pandas 中解决它,但不确定这是否可能.
What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.
澄清:合并是指外连接.每个表有两行:产品和版本.我想检查哪些产品和版本仅在左表、右表和两个表中.我用
To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a
pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')
推荐答案
这似乎是一项dask
专为.本质上,dask
可以在核心外执行 pandas
操作,因此您可以处理不适合内存的数据集.dask.dataframe
API 是 pandas
API 的一个子集,所以应该没有太多的学习曲线.请参阅Dask DataFrame 概览页面,了解一些额外的 DataFrame 特定详细信息.
This seems like a task that dask
was designed for. Essentially, dask
can do pandas
operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe
API is a subset of the pandas
API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.
import dask.dataframe as dd
# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')
# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])
# Write the output.
df.to_csv('file3.csv', index=False)
假设 'product'
和 'version'
是唯一的列,将 merge
替换为:
Assuming that 'product'
and 'version'
are the only columns, it may be more efficient to replace the merge
with:
df = dd.concat([df1, df2]).drop_duplicates()
我不完全确定这是否会更好,但显然没有在索引上完成的合并在 dask
中是缓慢的",因此值得一试.
I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask
, so it could be worth a try.
这篇关于在 Pandas 中加入两个大型数据集的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!