合并多个大型DataFrame的有效方法 [英] Efficient way to merge multiple large DataFrames

查看:181
本文介绍了合并多个大型DataFrame的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有4个小型DataFrames

Suppose I have 4 small DataFrames

df1df2df3df4

import pandas as pd
from functools import reduce
import numpy as np

df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   


df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']    
df4.columns = ['name', 'id', 'price']   

df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})

上面创建的是4个数据框,我想要的是下面的代码.

Create above are the 4 DataFrames, what I would like is in the code below.

# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

# Fill na values with 'missing'
df = df.fillna('missing')

所以我已经针对4个没有很多行和列的数据框实现了这一点.

So I have achieved this for 4 DataFrames that don't have many rows and columns.

基本上,我想将上述外部合并解决方案扩展到大小为62245 X 3的MULTIPLE(48)DataFrames.

所以我通过另一个使用lambda reduce的StackOverflow答案构建了这个解决方案:

So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:

from functools import reduce
import pandas as pd
import numpy as np
dfList = []

#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):

    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))


#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda  left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')

这导致MemoryError.

我不知道该怎么做才能阻止内核崩溃..我已经坚持了两天..我执行的一些EXACT merge操作的代码不会导致MemoryError或能给您同样结果的东西,将不胜感激.

I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryError or something that gives you the same result, would be really appreciated.

此外,主DataFrame中的3列(示例中不是可重现的48个DataFrame)的类型为int64int64float64,由于整数,我希望它们保持这种状态并浮动它所代表的.

Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64, int64 and float64 and I'd prefer them to stay that way because of the integer and float that it represents.

我不是以迭代方式尝试运行合并操作或使用reduce lambda函数,而是以2为一组来完成它!另外,我更改了某些列的数据类型,而有些则不必为float64.因此,我将其归结为float16.它距离很远,但仍然会抛出MemoryError.

Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64. So I brought it down to float16. It gets very far but still ends up throwing a MemoryError.

intermediatedfList = dfList    

tempdfList = []    

#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):

    #If there are even number of DataFrames
    if len(intermediatedfList)%2 == 0:

        #Go in steps of two
        for i in range(0, len(intermediatedfList), 2):

            #Merge DataFrame in index i, i + 1
            df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
            print(df1.info(memory_usage='deep'))

            #Append it to this list
            tempdfList.append(df1)

        #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
        #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
        intermediatedfList = tempdfList 

    else:

        #If there are odd number of DataFrames, keep the first DataFrame out

        tempdfList = [intermediatedfList[0]]

        #Go in steps of two starting from 1 instead of 0
        for i in range(1, len(intermediatedfList), 2):

            #Merge DataFrame in index i, i + 1
            df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
            print(df1.info(memory_usage='deep'))
            tempdfList.append(df1)

        #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
        #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
        intermediatedfList = tempdfList 

有什么方法可以优化代码来避免MemoryError,我什至使用过192GB的RAM(我现在欠他们7美元,我本可以给你们一个),这比我做的还要远已经得到,在将28个DataFrame的列表减少到4个之后,它仍然抛出MemoryError.

Is there any way I can optimize my code to avoid MemoryError, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryError after reducing a list of 28 DataFrames to 4..

推荐答案

使用pd.concat执行索引对齐的串联可能会带来一些好处.希望它应该比外部合并更快,更有效地利用内存.

You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.

df_list = [df1, df2, ...]
for df in df_list:
    df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)

或者,您可以将concat(第二步)替换为迭代的join:

Alternatively, you can replace the concat (second step) by an iterative join:

from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)

这可能会或可能不会优于merge.

This may or may not be better than the merge.

这篇关于合并多个大型DataFrame的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆