用NumPy向量化的groupby [英] Vectorized groupby with NumPy

查看:1106
本文介绍了用NumPy向量化的groupby的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pandas具有广泛使用的 groupby 可以根据对应的映射拆分DataFrame的功能,您可以根据该映射对每个子组应用计算并重新组合结果.

Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the results.

在没有本地Python for循环的情况下,可以在NumPy中灵活地完成此操作吗?在Python循环中,这看起来像:

Can this be done flexibly in NumPy without a native Python for-loop? With a Python loop, this would look like:

>>> import numpy as np

>>> X = np.arange(10).reshape(5, 2)
>>> groups = np.array([0, 0, 0, 1, 1])

# Split up elements (rows) of `X` based on their element wise group
>>> np.array([X[groups==i].sum() for i in np.unique(groups)])
array([15, 30])

上方15是X的前三行的总和,而30是其余两行的总和.

Above 15 is the sum of the first three rows of X, and 30 is the sum of the remaining two.

灵活地",我的意思是我们不是在关注某个特定的计算,例如求和,计数,最大值等,而是将任何计算传递给分组数组.

By "flexibly," I just mean that we aren't focusing on one particular computation such as sum, count, maximum, etc, but rather passing any computation to the grouped arrays.

如果没有,是否有比上述方法更快的方法?

If not, is there a faster approach than the above?

推荐答案

如果您想更灵活地实现groupby,可以使用numpyufunc中的任何一个进行分组:

If you want a more flexible implementation of groupby that can group using any of numpy's ufuncs:

def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
    if minlength < groups.max() + 1:
        minlength = groups.max() + 1
    if identity is None:
        identity = uf.identity
    i = list(range(X.ndim))
    del i[axis]
    i = tuple(i)
    n = out is None
    if n:
        if identity is None:  # fallback to loops over 0-index for identity
            assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
            s = [slice(None)] * X.ndim
            for i_ in i:
                s[i_] = 0
            out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
        else:
            out = np.full((minlength,), identity, dtype = X.dtype)
    uf.at(out, groups, uf.reduce(X, i))
    if n:
        return out

groupby_np(X, groups)
array([15, 30])

groupby_np(X, groups, uf = np.multiply)
array([   0, 3024])

groupby_np(X, groups, uf = np.maximum)
array([5, 9])

groupby_np(X, groups, uf = np.minimum)
array([0, 6])

这篇关于用NumPy向量化的groupby的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆