pandas groupby和rolling_apply忽略NaN [英] pandas groupby and rolling_apply ignoring NaNs

查看:506
本文介绍了 pandas groupby和rolling_apply忽略NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,我想计算一列的滚动平均值(在groupby子句之后).但是,我要排除NaN.

I have a pandas dataframe and I want to calculate the rolling mean of a column (after a groupby clause). However, I want to exclude NaNs.

例如,如果groupby返回[2,NaN,1],则结果应为1.5,而当前返回NaN.

For instance, if the groupby returns [2, NaN, 1], the result should be 1.5 while currently it returns NaN.

我尝试了以下方法,但似乎不起作用:

I've tried the following but it doesn't seem to work:

df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3,  lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))

如果我什至尝试这样做:

If I even try this:

df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3,  lambda x: 1)

我在输出中得到了NaN,所以这一定与熊猫在后台的工作方式有关.

I'm getting NaN in the output so it must be something to do with how pandas works in the background.

有什么想法吗?

这是我正在尝试执行的代码示例:

Here is a code sample with what I'm trying to do:

import pandas as pd
import numpy as np

df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] })
print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2,  lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))

结果是:

0    NaN
1    NaN
2    2.0
3    NaN
4    2.5
5    NaN
6    3.0
7    2.0

我想拥有以下内容:

0    NaN
1    NaN
2    2.0
3    2.0
4    2.5
5    3.0
6    3.0
7    2.0

推荐答案

像熊猫一样,坚持矢量化方法(即避免使用apply)对于性能和可伸缩性至关重要.

As always in pandas, sticking to vectorized methods (i.e. avoiding apply) is essential for performance and scalability.

您要执行的操作有点麻烦,因为目前对groupby对象的滚动操作尚不支持NaN(版本0.18.1).因此,我们需要几行代码:

The operation you want to do is a little fiddly as rolling operations on groupby objects are not NaN-aware at present (version 0.18.1). As such, we'll need a few short lines of code:

g1 = df.groupby(['var1'])['value']              # group values  
g2 = df.fillna(0).groupby(['var1'])['value']    # fillna, then group values

s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation

s.reset_index(level=0, drop=True).sort_index()  # drop/sort index

这个想法是对窗口中的值求和(使用sum),计算NaN值(使用count),然后除以找到平均值.这段代码提供了与您所需的输出相匹配的以下输出:

The idea is to sum the values in the window (using sum), count the NaN values (using count) and then divide to find the mean. This code gives the following output that matches your desired output:

0    NaN
1    NaN
2    2.0
3    2.0
4    2.5
5    3.0
6    3.0
7    2.0
Name: value, dtype: float64

在更大的DataFrame(大约100,000行)上进行测试,运行时间不到100毫秒,比我尝试的任何基于Apply的方法都快得多.

Testing this on a larger DataFrame (around 100,000 rows), the run-time was under 100ms, significantly faster than any apply-based methods I tried.

可能有必要在您的实际数据上测试不同的方法,因为时间安排可能会受到其他因素(例如组数)的影响.可以肯定的是,矢量化计算将会胜出.

It may be worth testing the different approaches on your actual data as timings may be influenced by other factors such as the number of groups. It's fairly certain that vectorized computations will win out, though.

上面显示的方法非常适合简单的计算,例如滚动平均值.尽管实现涉及更多,但它将适用于更复杂的计算(例如滚动标准偏差).

The approach shown above works well for simple calculations, such as the rolling mean. It will work for more complicated calculations (such as rolling standard deviation), although the implementation is more involved.

总体思路是查看每个在熊猫中速度较快的简单例程(例如sum),然后使用标识元素(例如0)填充所有空值.然后,您可以使用groubpy并执行滚动操作(例如.rolling(2).sum()).然后将输出与其他操作的输出合并.

The general idea is look at each simple routine that is fast in pandas (e.g. sum) and then fill any null values with an identity element (e.g. 0). You can then use groubpy and perform the rolling operation (e.g. .rolling(2).sum()). The output is then combined with the output(s) of other operations.

例如,要实现可以感知NaN的分组滚动方差(标准差是平方根),我们必须找到平方的均值减去均值的平方".这是它的外观示意图:

For example, to implement groupby NaN-aware rolling variance (of which standard deviation is the square-root) we must find "the mean of the squares minus the square of the mean". Here's a sketch of what this could look like:

def rolling_nanvar(df, window):
    """
    Group df by 'var1' values and then calculate rolling variance,
    adjusting for the number of NaN values in the window.

    Note: user may wish to edit this function to control degrees of
    freedom (n), depending on their overall aim.
    """
    g1 = df.groupby(['var1'])['value']
    g2 = df.fillna(0).groupby(['var1'])['value']
    # fill missing values with 0, square values and groupby
    g3 = df['value'].fillna(0).pow(2).groupby(df['var1'])

    n = g1.rolling(window).count()

    mean_of_squares = g3.rolling(window).sum() / n
    square_of_mean = (g2.rolling(window).sum() / n)**2
    variance = mean_of_squares - square_of_mean
    return variance.reset_index(level=0, drop=True).sort_index()

请注意,此函数可能在数值上不稳定(平方可能会导致溢出).熊猫内部使用 Welford算法来缓解此问题.

Note that this function may not be numerically stable (squaring could lead to overflow). pandas uses Welford's algorithm internally to mitigate this issue.

无论如何,此功能尽管使用了多个操作,但仍然非常快.这是与 Yakym Pirozhenko 建议的基于应用的更简洁方法的比较:

Anyway, this function, although it uses several operations, is still very fast. Here's a comparison with the more concise apply-based method suggested by Yakym Pirozhenko:

>>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows
>>> %timeit df2.groupby('var1')['value'].apply(\
         lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar))
1 loops, best of 3: 11 s per loop

>>> %timeit rolling_nanvar(df2, 7)
10 loops, best of 3: 110 ms per loop

在这种情况下,矢量化速度快100倍.当然,根据您拥有的数据量,您可能希望坚持使用apply,因为它可以使您通用/简洁,而以性能为代价.

Vectorization is 100 times faster in this case. Of course, depending on how much data you have, you may wish to stick to using apply since it allows you generality/brevity at the expense of performance.

这篇关于 pandas groupby和rolling_apply忽略NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆