pandas :合并数组太大,太大,如何合并部分? [英] Pandas: Merge array is too big, large, how to merge in parts?

查看:103
本文介绍了 pandas :合并数组太大,太大,如何合并部分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当尝试使用熊猫合并两个数据框时,我收到以下消息:"ValueError:数组太大."我估计合并后的表将有约50亿行,这可能对我拥有8GB RAM的计算机来说是太多了(这是受RAM限制还是内置于Pandas系统中?).

When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).

我知道一旦有了合并表,我将计算新列,然后过滤行,以查找组内的最大值.因此,最终输出表将只有250万行.

I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.

如何解决此问题,以便可以在较小的部件上执行此合并方法并构建输出表,而又不影响我的RAM限制?

How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?

下面的方法对于这种小的数据可以正确地工作,但是对于较大的真实数据却不能:

The method below works correctly for this small data, but fails on the larger, real data:

import pandas as pd
import numpy as np

# Create input tables
t1 = {'scenario':[0,0,1,1],
      'letter':['a','b']*2,
      'number1':[10,50,20,30]}

t2 = {'letter':['a','a','b','b'],
      'number2':[2,5,4,7]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']

# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]

这是前面两个问题的后续解答:

This is a follow up to two previous questions:

迭代是否存在性能问题?

什么是好方法为了避免在此示例中使用迭代?

我在下面回答我自己的问题.

I answer my own Q below.

推荐答案

您可以使用groupby分解第一个表(例如,在方案"中).首先创建一个新变量,为您提供所需大小的组,这可能是有道理的.然后遍历这些组每个:执行新的合并,过滤,然后将较小的数据附加到最终输出表中.

You can break up the first table using groupby (for instance, on 'scenario'). It could make sense to first make a new variable which gives you groups of exactly the size you want. Then iterate through these groups doing the following on each: execute a new merge, filter and then append the smaller data into your final output table.

如迭代是否存在性能问题?"中所述,迭代很慢.因此,请尝试使用较大的组,以使其尽可能使用最有效的方法.合并时,Pandas 相对较快.

As explained in "Does iterrows have performance issues?", iterating is slow. Therefore try to use large groups to keep it using the most efficient methods possible. Pandas is relatively quick when it comes to merging.

接下来是创建输入表之后的

Following on from after you create the input tables

table3 = pd.DataFrame()

grouped = table1.groupby('scenario')

for _, group in grouped: 
    temp = pd.merge(group,table2, on='letter')
    temp['calc']=temp['number1']*temp['number2']
    table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
    del temp

这篇关于 pandas :合并数组太大,太大,如何合并部分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆