pandas :df.groupby()对于大数据集来说太慢了.有其他替代方法吗? [英] Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

查看:230
本文介绍了 pandas :df.groupby()对于大数据集来说太慢了.有其他替代方法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

df = df.groupby(df.index).sum()

我有一个包含380万行(单列)的数据框,并且我试图按索引对它们进行分组.但是,它需要永远完成计算.是否有其他方法可以处理非常大的数据集?在此先感谢!!!

I have a dataframe with 3.8 million rows (single column), and I'm trying to group them by index. But it takes forever to finish the computation. Are there any alternative ways to deal with a very large data set? Thanks in advance!!!!

我正在用Python编写.

I'm writing in Python.

数据如下所示. 索引是客户ID.我想按Index分组qty_liter.

The data looks like as below. The index is the customer ID. I want to group the qty_liter by the Index.

df = df.groupby(df.index).sum()

但是这行代码花费了太多时间.....

But this line of code is taking toooo much time.....

有关此df的信息如下:

the info about this df is below:

df.info()

<class 'pandas.core.frame.DataFrame'> Index: 3842595 entries, -2147153165 to \N Data columns (total 1 columns): qty_liter object dtypes: object(1) memory usage: 58.6+ MB

<class 'pandas.core.frame.DataFrame'> Index: 3842595 entries, -2147153165 to \N Data columns (total 1 columns): qty_liter object dtypes: object(1) memory usage: 58.6+ MB

推荐答案

问题是您的数据不是数字.处理字符串比处理数字要花费更长的时间.先尝试一下:

The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:

df.index = df.index.astype(int)
df.qty_liter = df.qty_liter.astype(float)

然后再次执行groupby().它应该快得多.如果是这样,请查看是否可以从一开始就修改数据加载步骤以使其具有正确的dtypes.

Then do groupby() again. It should be much faster. If it is, see if you can modify your data loading step to have the proper dtypes from the beginning.

这篇关于 pandas :df.groupby()对于大数据集来说太慢了.有其他替代方法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆