分组时应用自定义函数将返回NaN [英] Applying custom function while grouping returns NaN

查看:82
本文介绍了分组时应用自定义函数将返回NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个字典,performances,存储某种类型的系列:

Given a dict, performances, storing Series of kind:

2015-02-28           NaN
2015-03-02    100.000000
2015-03-03     98.997117
2015-03-04     98.909215
2015-03-05     99.909979
2015-03-06    100.161486
2015-03-09    100.502772
2015-03-10    101.685314
2015-03-11    102.518433
2015-03-12    102.427237
2015-03-13    103.424257
2015-03-16    102.669184
2015-03-17    102.181841
2015-03-18    102.436339
2015-03-19    102.672482
2015-03-20    102.238386
2015-03-23    101.460082
...

我想按月对它们进行分组,但是对于每个月的数据集,只选择不是np.nan的第一个值:

I want to group them by month, but only pick the first value which is not np.nan, for each month's data set:

for perf in performance:
    performance[perf] = performance[perf].groupby(performance[perf].index.month).apply(return_first)


def return_first(array_like):
    # Return data from 1st of month, or first value that is not np.nan
    for i in range(len(array_like)):
        if np.isnan(array_like[i]):
            continue
        else:
            return(array_like[i])

这将返回nan值:

2015-02-28   NaN
2015-03-02   NaN
2015-03-03   NaN
2015-03-04   NaN
2015-03-05   NaN
2015-03-06   NaN
2015-03-09   NaN
2015-03-10   NaN
2015-03-11   NaN
2015-03-12   NaN
2015-03-13   NaN
2015-03-16   NaN
2015-03-17   NaN
2015-03-18   NaN
2015-03-19   NaN
2015-03-20   NaN
2015-03-23   NaN
...

应该在什么时候出现:

2015-03-02   100   
...

我无法怀疑自己的索引,这似乎是一个很好的pd.DateTimeIndex:

I cannot suspect my index, which seems to be a prefectly fine pd.DateTimeIndex:

DatetimeIndex(['2015-02-28', '2015-03-02', '2015-03-03', '2015-03-04',
           '2015-03-05', '2015-03-06', '2015-03-09', '2015-03-10',
           '2015-03-11', '2015-03-12',
           ...
           '2016-02-16', '2016-02-17', '2016-02-18', '2016-02-19',
           '2016-02-22', '2016-02-23', '2016-02-24', '2016-02-25',
           '2016-02-26', '2016-02-29'],
          dtype='datetime64[ns]', length=265, freq=None)

我哪里出错了?

推荐答案

如果每个月至少有一个非NaN值,请使用

If each month has at least one non NaN value, use first_valid_index:

print (df.b.groupby(df.index.month).apply(lambda x: x[x.first_valid_index()]))

更通用的解决方案,如果某个月中的所有值都为NaN,则返回NaN:

More general solution, which return NaN if all values in some month are NaN:

def f(x):
    if x.first_valid_index() is None:
        return np.nan
    else:
        return x[x.first_valid_index()]

print (df.b.groupby(df.index.month).apply(f))

2      NaN
3    100.0
Name: b, dtype: float64

如果要按yearsmonths分组,请使用

If you want group by years and months use to_period:

print (df.b.groupby(df.index.to_period('M')).apply(f))
2015-02      NaN
2015-03    100.0
Freq: M, Name: b, dtype: float64

示例:

import pandas as pd
import numpy as np

df = pd.DataFrame({'b': pd.Series({ pd.Timestamp('2015-07-19 00:00:00'): 102.67248199999999,  pd.Timestamp('2015-04-05 00:00:00'):  np.nan,  pd.Timestamp('2015-02-25 00:00:00'):  np.nan,  pd.Timestamp('2015-04-09 00:00:00'): 100.50277199999999,  pd.Timestamp('2015-06-18 00:00:00'): 102.436339,  pd.Timestamp('2015-06-16 00:00:00'): 102.669184,  pd.Timestamp('2015-04-10 00:00:00'): 101.68531400000001,  pd.Timestamp('2015-05-12 00:00:00'): 102.42723700000001,  pd.Timestamp('2015-07-20 00:00:00'): 102.23838600000001,  pd.Timestamp('2015-06-17 00:00:00'):  np.nan,  pd.Timestamp('2015-08-23 00:00:00'): 101.460082,  pd.Timestamp('2015-03-03 00:00:00'): 98.997117000000003,  pd.Timestamp('2015-03-02 00:00:00'): 100.0,  pd.Timestamp('2015-05-11 00:00:00'): 102.518433,  pd.Timestamp('2015-03-04 00:00:00'): 98.909215000000003, pd.Timestamp('2015-05-13 00:00:00'): 103.424257,  pd.Timestamp('2015-04-06 00:00:00'):  np.nan})})

print (df)

                     b
2015-02-25         NaN
2015-03-02  100.000000
2015-03-03   98.997117
2015-03-04   98.909215
2015-04-05         NaN
2015-04-06         NaN
2015-04-09  100.502772
2015-04-10  101.685314
2015-05-11  102.518433
2015-05-12  102.427237
2015-05-13  103.424257
2015-06-16  102.669184
2015-06-17         NaN
2015-06-18  102.436339
2015-07-19  102.672482
2015-07-20  102.238386
2015-08-23  101.460082

def f(x):
    if x.first_valid_index() is None:
        return np.nan
    else:
        return x[x.first_valid_index()]

print (df.b.groupby(df.index.to_period('M')).apply(f))
2015-02           NaN
2015-03    100.000000
2015-04    100.502772
2015-05    102.518433
2015-06    102.669184
2015-07    102.672482
2015-08    101.460082
Freq: M, Name: b, dtype: float64

这篇关于分组时应用自定义函数将返回NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆