pandas 汇总平均值,但不包括当前行 [英] Pandas aggregating average while excluding current row

查看:78
本文介绍了 pandas 汇总平均值,但不包括当前行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在排除当前行(目标结果在c中)的同时,如何以汇总方式获取组a的平均值b?

How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?

a b   c

1 1   0.5   # (avg of 0 & 1, excluding 1)
1 1   0.5   # (avg of 0 & 1, excluding 1)
1 0   1     # (avg of 1 & 1, excluding 0)

2 1   0.5   # (avg of 0 & 1, excluding 1)
2 0   1     # (avg of 1 & 1, excluding 0)
2 1   0.5   # (avg of 0 & 1, excluding 1)

3 1   0.5   # (avg of 0 & 1, excluding 1)
3 0   1     # (avg of 1 & 1, excluding 0)
3 1   0.5   # (avg of 0 & 1, excluding 1)

数据转储:

import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1], 
                     [2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
                     columns=['a', 'b', 'c'])

推荐答案

假设一个组的值是x_1, ..., x_n.

整个小组的平均值为

m = (x_1 + ... + x_n)/n

不含x_i的组的总和为

(m*n - x_i)

不含x_i的组的平均值为

(m*n - x_i)/(n-1)

因此,您可以使用

import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1], 
                     [2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
                     columns=['a', 'b', 'c'])

grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)

产生

In [32]: df
Out[32]: 
   a  b    c  result
0  1  1  0.5     0.5
1  1  1  0.5     0.5
2  1  0  1.0     1.0
3  2  1  0.5     0.5
4  2  0  1.0     1.0
5  2  1  0.5     0.5
6  3  1  0.5     0.5
7  3  0  1.0     1.0
8  3  1  0.5     0.5

In [33]: assert df['result'].equals(df['c'])


根据以下评论,在OP的实际使用案例中,DataFrame的a列 包含字符串:


Per the comments below, in the OP's actual use case, the DataFrame's a column contains strings:

def make_random_str_array(letters, strlen, size):
    return (np.random.choice(list(letters), size*strlen)
            .view('|S{}'.format(strlen)))

N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
                   'b':np.random.randint(10, size=N)})

,因此在df['a']中大约有300万个唯一值 总计:

so that there are about a million unique values in df['a'] out of 3 million total:

In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337

In [89]: len(df)
Out[89]: 3000000

在这种情况下,上面的计算(在我的机器上)大约需要 11秒:

In this case the calculation above requires (on my machine) about 11 seconds:

In [86]: %%timeit
   ....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
   ....:    ....:    ....:    ....: 
1 loops, best of 3: 10.5 s per loop

Pandas将所有字符串值列转换为object dtype .但是我们可以将 DataFrame列到具有固定宽度dtype的NumPy数组,以及该组 根据这些值.

Pandas converts all string-valued columns to object dtype. But we could convert the DataFrame column to a NumPy array with a fixed-width dtype, and the group according to those values.

这里有一个基准显示,如果将具有对象dtype的Series转换为具有固定宽度的字符串dtype的NumPy数组,则计算所需的时间少于 2秒:

Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:

In [97]: %%timeit
   ....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
   ....:    ....:    ....:    ....: 
1 loops, best of 3: 1.39 s per loop

请注意,您需要知道df['a']中字符串的最大长度,才能选择适当的固定宽度dtype.在上面的示例中,所有字符串的长度均为4,因此|S4起作用.如果对某些整数n使用|Snn小于最长字符串,则这些字符串将被无提示地截断,而不会出现错误警告.这可能会导致不应分组的值分组.因此,您有责任选择正确的固定宽度dtype.

Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.

您可以使用

dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))

以确保转换使用正确的dtype.

to ensure the conversion uses the correct dtype.

这篇关于 pandas 汇总平均值,但不包括当前行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆