在 Pandas 中加入两个大型数据集的最佳方式 [英] Best way to join two large datasets in Pandas

查看:27
本文介绍了在 Pandas 中加入两个大型数据集的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从需要连接的两个不同数据库下载两个数据集.当我将它们存储为 CSV 时,它们中的每一个分别约为 500MB.单独装入内存,但是当我同时加载两者时,有时会出现内存错误.当我尝试将它们与熊猫合并时,我肯定会遇到麻烦.

I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.

对它们进行外连接以防止出现内存错误的最佳方法是什么?我手头没有任何数据库服务器,但如果有帮助,我可以在我的计算机上安装任何类型的开源软件.理想情况下,我仍然只想在 Pandas 中解决它,但不确定这是否可能.

What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.

澄清:合并是指外连接.每个表有两行:产品和版本.我想检查哪些产品和版本仅在左表、右表和两个表中.我用

To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')

推荐答案

这似乎是一项dask 专为.本质上,dask 可以在核心外执行 pandas 操作,因此您可以处理不适合内存的数据集.dask.dataframe API 是 pandas API 的一个子集,所以应该没有太多的学习曲线.请参阅Dask DataFrame 概览页面,了解一些额外的 DataFrame 特定详细信息.

This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

假设 'product''version' 是唯一的列,将 merge 替换为:

Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:

df = dd.concat([df1, df2]).drop_duplicates()

我不完全确定这是否会更好,但显然没有在索引上完成的合并在 dask 中是缓慢的",因此值得一试.

I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

这篇关于在 Pandas 中加入两个大型数据集的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆