Pandas - 将大数据帧切成块 [英] Pandas - Slice large dataframe into chunks

查看:39
本文介绍了Pandas - 将大数据帧切成块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据框(>3MM 行),我正试图通过一个函数(下面的一个在很大程度上简化了),但我不断收到 内存错误 消息.

I have a large dataframe (>3MM rows) that I'm trying to pass through a function (the one below is largely simplified), and I keep getting a Memory Error message.

我认为我将太大的数据帧传递到函数中,所以我正在尝试:

I think I'm passing too large of a dataframe into the function, so I'm trying to:

1) 将数据帧切成更小的块(最好由 AcctName 切片)

1) Slice the dataframe into smaller chunks (preferably sliced by AcctName)

2) 将数据帧传递给函数

2) Pass the dataframe into the function

3) 将数据帧连接回一个大数据帧

3) Concatenate the dataframes back into one large dataframe

def trans_times_2(df):
    df['Double_Transaction'] = df['Transaction'] * 2

large_df 
AcctName   Timestamp    Transaction
ABC        12/1         12.12
ABC        12/2         20.89
ABC        12/3         51.93    
DEF        12/2         13.12
DEF        12/8          9.93
DEF        12/9         92.09
GHI        12/1         14.33
GHI        12/6         21.99
GHI        12/12        98.81

我知道我的函数可以正常工作,因为它可以处理较小的数据框(例如 40,000 行).我尝试了以下操作,但无法将小数据帧连接回一个大数据帧.

I know that my function works properly, since it will work on a smaller dataframe (e.g. 40,000 rows). I tried the following, but I was unsuccessful with concatenating the small dataframes back into one large dataframe.

def split_df(df):
    new_df = []
    AcctNames = df.AcctName.unique()
    DataFrameDict = {elem: pd.DataFrame for elem in AcctNames}
    key_list = [k for k in DataFrameDict.keys()]
    new_df = []
    for key in DataFrameDict.keys():
        DataFrameDict[key] = df[:][df.AcctNames == key]
        trans_times_2(DataFrameDict[key])
    rejoined_df = pd.concat(new_df)

我如何设想拆分的数据帧:

df1
AcctName   Timestamp    Transaction  Double_Transaction
ABC        12/1         12.12        24.24
ABC        12/2         20.89        41.78
ABC        12/3         51.93        103.86

df2
AcctName   Timestamp    Transaction  Double_Transaction
DEF        12/2         13.12        26.24
DEF        12/8          9.93        19.86
DEF        12/9         92.09        184.18

df3
AcctName   Timestamp    Transaction  Double_Transaction
GHI        12/1         14.33        28.66
GHI        12/6         21.99        43.98
GHI        12/12        98.81        197.62

推荐答案

您可以使用列表理解将数据帧拆分为包含在列表中的更小的数据帧.

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

您可以通过以下方式访问块:

You can access the chunks with:

list_df[0]
list_df[1]
etc...

然后您可以使用 pd.concat 将其组装回一个单一的数据帧.

Then you can assemble it back into a one dataframe using pd.concat.

按帐户名称

list_df = []

for n,g in df.groupby('AcctName'):
    list_df.append(g)

这篇关于Pandas - 将大数据帧切成块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆