在Pandas中联接两个大型数据集的最佳方法 [英] Best way to join two large datasets in Pandas

查看:266
本文介绍了在Pandas中联接两个大型数据集的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从需要连接的两个不同数据库中下载两个数据集.当我将它们存储为CSV时,它们各自分别约为500MB.将其单独装入内存中,但是当我同时加载两者时,有时会出现内存错误.当我尝试将它们与大熊猫合并时,我肯定会遇到麻烦.

I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.

对它们进行外部联接以免出现内存错误的最佳方法是什么?我手边没有任何数据库服务器,但是如果有帮助,我可以在计算机上安装任何种类的开源软件.理想情况下,我仍然只想在熊猫中解决该问题,但不确定是否完全可能.

What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.

澄清一下:合并是指外部联接.每个表都有两行:产品和版本.我想检查哪些产品和版本仅在左表,仅右表以及两个表中.我用

To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')

推荐答案

这似乎是 Dask DataFrame概述页面,以获取其他一些有关DataFrame的详细信息.

This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

假定'product''version'是唯一的列,将merge替换为:

Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:

df = dd.concat([df1, df2]).drop_duplicates()

我不确定是否会更好,但是显然未在索引上进行的合并在dask中是缓慢的",因此值得一试.

I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

这篇关于在Pandas中联接两个大型数据集的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆