更快地执行pandas groupby操作的替代方法 [英] Faster alternative to perform pandas groupby operation

查看：431 发布时间：2020/5/18 20:42:48 python pandas numpy pandas-groupby

本文介绍了更快地执行pandas groupby操作的替代方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个名称为(person_name)，日期和颜色(shirt_color)为列的数据集.

I have a dataset with name (person_name), day and color (shirt_color) as columns.

每个人在特定的一天都穿着某种颜色的衬衫.天数可以是任意的.

Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.

例如输入:

name  day  color
----------------
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White

我需要找到每个人最常用的颜色.

I need to find the most frequently used color by each person.

例如结果:

name    color
-------------
Jerry   Black
John    White
Tom     Blue

我正在执行以下操作来获取结果，该方法可以正常运行，但速度很慢:

I am performing the following operation to get the results, which works fine but is quite slow:

most_frquent_list = [[name, group.color.mode()[0]] 
                        for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])

现在假设我有一个包含500万个唯一名称的数据集.进行上述操作的最佳/最快方法是什么?

Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?

Numpy的`numpy.add.at`和`pandas.factorize`

这是为了快速.但是，我也尝试将其组织为可读性.

Numpy's `numpy.add.at` and `pandas.factorize`

This is intended to be fast. However, I tried to organize it to be readable as well.

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object

`groupby`，`size`和`idxmax`

`groupby`, `size`, and `idxmax`

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

`Counter`

¯\_(ツ)_/¯

`Counter`

¯\_(ツ)_/¯

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

这篇关于更快地执行pandas groupby操作的替代方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

更快地执行pandas groupby操作的替代方法 [英] Faster alternative to perform pandas groupby operation

问题描述

推荐答案

Numpy的`numpy.add.at`和`pandas.factorize`

Numpy's `numpy.add.at` and `pandas.factorize`

`groupby`，`size`和`idxmax`

`groupby`, `size`, and `idxmax`

`Counter`

`Counter`

相关文章

Python最新文章

热门教程

热门工具

登录关闭

更快地执行pandas groupby操作的替代方法 [英] Faster alternative to perform pandas groupby operation

问题描述

推荐答案

Numpy的numpy.add.at和pandas.factorize

Numpy's numpy.add.at and pandas.factorize

groupby，size和idxmax

groupby, size, and idxmax

Counter

Counter

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Numpy的`numpy.add.at`和`pandas.factorize`

Numpy's `numpy.add.at` and `pandas.factorize`

`groupby`，`size`和`idxmax`

`groupby`, `size`, and `idxmax`

`Counter`

`Counter`

登录关闭