pandas.DataFrame.groupby是否保证稳定? [英] Is pandas.DataFrame.groupby Guaranteed To Be Stable?

查看:175
本文介绍了pandas.DataFrame.groupby是否保证稳定?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到 pd.DataFrame.groupby 后跟一个 apply 隐式假设 groupby stable - 也就是说,如果 a b 是同一组的实例,并且预分组 出现在 b 之前,那么在分组之后, a 也会出现在 b 之前。

我认为有几个答案明确隐含地使用这个,但具体来说,这里是一个使用 groupby + cumsum

有没有真正有希望这种行为的东西?该文档仅说明:b
$ b


使用mapper(dict或key函数,将给定函数应用于组,将结果作为系列返回)或一系列的列。

另外,拥有索引的熊猫,理论上也可以在没有这种保证的情况下实现功能(尽管更多尽管这些文档没有在内部陈述它,但它在生成组时使用了稳定的排序方式。



请参阅:



正如我在评论中提到的那样,如果考虑 transform ,它将返回一个Series,其索引与原始df对齐。如果排序不保留订单,那么这将使对齐执行额外的工作,因为在分配之前需要对Series进行排序。实际上,在评论中提到:


_algos.groupsort_indexer 执行计数排序,它至少为
O(ngroups),其中

ngroups = prod(形状)
$ b

shape = map(len,keys)



<也就是说,groupby键的唯一
值的组合数量(笛卡尔乘积)是线性的。做多键groupby时,这可能很大。
np.argsort(kind ='mergesort') O(count x log(count))其中count是数据框的
长度;
这两种算法都是 stable 排序,这对于
groupby操作的正确性是必需的。


例如。考虑:
df.groupby(key)[col] .transform('first')



I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.

I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.

Is there anything actually promising this behavior? The documentation only states:

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).

解决方案

Although the docs don't state this internally, it uses stable sort when generating the groups.

See:

As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:

_algos.groupsort_indexer implements counting sort and it is at least O(ngroups), where

ngroups = prod(shape)

shape = map(len, keys)

That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations.

e.g. consider: df.groupby(key)[col].transform('first')

这篇关于pandas.DataFrame.groupby是否保证稳定?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆