更快地执行pandas groupby操作的替代方法 [英] Faster alternative to perform pandas groupby operation

查看:431
本文介绍了更快地执行pandas groupby操作的替代方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名称为(person_name),日期和颜色(shirt_color)为列的数据集.

I have a dataset with name (person_name), day and color (shirt_color) as columns.

每个人在特定的一天都穿着某种颜色的衬衫.天数可以是任意的.

Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.

例如输入:

name  day  color
----------------
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White

我需要找到每个人最常用的颜色.

I need to find the most frequently used color by each person.

例如结果:

name    color
-------------
Jerry   Black
John    White
Tom     Blue

我正在执行以下操作来获取结果,该方法可以正常运行,但速度很慢:

I am performing the following operation to get the results, which works fine but is quite slow:

most_frquent_list = [[name, group.color.mode()[0]] 
                        for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])

现在假设我有一个包含500万个唯一名称的数据集.进行上述操作的最佳/最快方法是什么?

Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?

推荐答案

Numpy的numpy.add.atpandas.factorize

这是为了快速.但是,我也尝试将其组织为可读性.

Numpy's numpy.add.at and pandas.factorize

This is intended to be fast. However, I tried to organize it to be readable as well.

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object


groupbysizeidxmax


groupby, size, and idxmax

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object


Counter

¯\_(ツ)_/¯


Counter

¯\_(ツ)_/¯

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

这篇关于更快地执行pandas groupby操作的替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆