带有python/pandas和大的左外部联接的MemoryError [英] MemoryError with python/pandas and large left outer joins

查看:146
本文介绍了带有python/pandas和大的左外部联接的MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python和Pandas都是陌生的,试图找出最快的方法在大约1100万行的左数据集和大约16万行和4列的右数据集之间执行猛烈的左外部联接.这应该是多对一的情况,但是如果右侧有重复的行,我希望联接不踢出错误.我在具有8 Gb RAM的Windows 7 64位系统上使用Canopy Express,而我对此几乎一无所知.

I'm fairly new to both Python and Pandas, and trying to figure out the fastest way to execute a mammoth left outer join between a left dataset with roughly 11 million rows and a right dataset with ~160K rows and four columns. It should be a many-to-one situation but I'd like the join to not kick out an error if there's a duplicate row on the right side. I'm using Canopy Express on a Windows 7 64-bit system with 8 Gb RAM, and I'm pretty much stuck with that.

这是到目前为止我编写的代码模型:

Here's a model of the code I've put together so far:

import pandas as pd

leftcols = ['a','b','c','d','e','key']
leftdata = pd.read_csv("LEFT.csv", names=leftcols)

rightcols = ['x','y','z','key']
rightdata = pd.read_csv("RIGHT.csv", names=rightcols)

mergedata = pd.merge(leftdata, rightdata, on='key', how='left')
mergedata.to_csv("FINAL.csv")

这适用于小文件,但在我的系统上会产生一个MemoryError,其文件大小比我实际需要合并的文件大小小两个数量级.

This works with small files but produces a MemoryError on my system with file sizes two orders of magnitude smaller than the size of the files I actually need to merge.

我一直在浏览相关问题(一个两个

I've been browsing through related questions (one, two, three) but none of the answers really get at this basic problem - or if they do, it's not explained well enough for me to recognize the potential solution. And the accepted answers are no help. I'm already on a 64 bit system and using the most current stable version of Canopy (1.5.5 64-bit, using Python 2.7.10).

避免此MemoryError问题的最快和/或最有效的方法是什么?

What is the fastest and/or most pythonic approach to avoiding this MemoryError issue?

推荐答案

这种方法最终奏效了.这是我的代码模型:

This approach ended up working. Here's a model of my code:

import csv

idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")

leftdata = csv.reader(idata)
rightdata = csv.reader(odata)

def gen_chunks(reader, chunksize=1000000):
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

count = 0

d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0)    
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])

for chunk in gen_chunks(leftdata):
    res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], 
                d1.get(k[6], "NaN")] for k in chunk]
    res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], 
                d2.get(k[6], "NaN")] for k in res]
    res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
                d3.get(k[6], "NaN")] for k in res1]
    namestart = "FINAL_"
    nameend = ".csv"
    count = count+1
    filename = namestart + str(count) + nameend
    with open(filename, "wb") as csvfile:
        output = csv.writer(csvfile)
        output.writerows(res2)

通过将左数据集拆分为多个块,将右数据集变成每个非关键列一个字典,并通过将列添加到左数据集中(使用字典和键匹配来填充它们),脚本成功完成了整个左连接在大约四分钟内没有记忆问题.

By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.

还要感谢用户 miku ,该用户在

Thanks also to user miku who provided the chunk generator code in a comment on this post.

那是:我高度怀疑这是最有效的方法.如果有人对改进此方法有任何建议,请开除.

That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.

这篇关于带有python/pandas和大的左外部联接的MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆