避免大 pandas DataFrame上GroupBy的内存问题 [英] Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

查看:190
本文介绍了避免大 pandas DataFrame上GroupBy的内存问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:

pandas df是这样创建的:

The pandas df was created like this:

df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])

从此df创建一个dask df看起来像这样:

Creating a dask df from this df looks like this:

df = dd.from_pandas(encoded, 50)

使用快捷方式执行操作不会导致任何可见的进度(使用快捷方式诊断进行检查):

Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):

result = df.groupby('journal_entry').max().reset_index().compute()

原文:

我有一个大熊猫df,具有270万行和4,000列.除四列外,所有列均为dtype uint8. uint8列仅包含1或0的值.我试图在df上执行此操作:

I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:

result = df.groupby('id').max().reset_index()

可以预计,此操作将立即返回内存错误.我最初的想法是在水平和垂直方向上对df进行分块.但是,这会造成混乱,因为.max()需要在所有uint8列中执行,而不仅仅是在一对列中执行.此外,像这样对df进行分块仍然非常慢.我的机器上有32 GB的RAM.

Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.

哪种策略可以减轻内存问题?

What strategy could mitigate the memory issue?

推荐答案

,您可以使用 dask .dataframe 用于此任务

you could use dask.dataframe for this task

import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()

所有您需要做的就是将您的pandas.DataFrame转换为dask.dataframe. Dask是python核心外并行化框架,提供了各种并行化容器类型,其中之一就是数据帧.它使您可以并行执行最常见的pandas.DataFrame操作和/或使用太大而无法容纳在内存中的数据进行分发. dask的核心是一组调度程序和一个用于构建计算图的API,因此我们必须在最后调用.compute()才能真正进行任何计算.该库易于安装,因为它大部分是用纯python编写的.

All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.

这篇关于避免大 pandas DataFrame上GroupBy的内存问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆