使用Pandas Groupby和应用功能时处理None值 [英] Dealing with None values when using Pandas Groupby and Apply with a Function

查看:98
本文介绍了使用Pandas Groupby和应用功能时处理None值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Pandas中有一个Dataframe,其中有一个字母和两个日期作为列.我想使用shift()计算上一行的两个日期列之间的工作日,前提是Letter值相同(使用.groupby()).我正在使用.apply()进行此操作.这一直有效,直到我传递了一些缺少某个日期的数据.我将所有内容移到一个函数中,以使用try/except子句处理缺少的值,但是现在我的函数对所有内容均返回NaN.看来日期的None值会影响该函数的每次调用,而我认为只有在.groupby()中的LetterA时,它才会起作用.

I have a Dataframe in Pandas with a letter and two dates as columns. I would like to calculate the business days between the two date columns for the previous row using shift(), provided that the Letter value is the same (using a .groupby()). I was doing this with .apply(). This worked until I passed in some data in which one of the dates was missing. I moved everything to a function to handle the missing value with a try/except clause, but now my function returns NaN for everything. It appears the None value for date is impacting each call of the function, whereas I would think it would only do it when the Letter from the .groupby() is A.

import pandas as pd
from datetime import datetime
import numpy as np

def business_days(x):
    try:
      return pd.DataFrame(np.busday_count(x['First Date'].tolist(), x['Last Date'].tolist())).shift().reset_index(drop=True)
    except ValueError:
        return None

df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), None],
                        ['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
                        ['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
                        ['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
                  columns=['Letter', 'First Date', 'Last Date'])

df['First Date'] = df['First Date'].apply(lambda x: x.to_datetime().date())
df['Last Date'] = df['Last Date'].apply(lambda x: x.to_datetime().date())

df['Gap'] = df.groupby('Letter').apply(business_days)

print df

实际输出:

  Letter  First Date   Last Date  Gap
0      A  2016-01-07         NaT  NaN
1      A  2016-03-01  2016-03-08  NaN
2      B  2016-05-01  2016-05-10  NaN
3      B  2016-06-05  2016-06-07  NaN

所需的输出:

  Letter   First Day    Last Day   Gap
0      A  2016-01-07         NAT  NAN
1      A  2016-03-01  2016-03-08  NAN
2      B  2016-05-01  2016-05-10  NAN
3      B  2016-06-05  2016-06-07  7

推荐答案

  • 暂时忽略NaT,请注意np.busday_count计算 可以在df 之前的整个列上应用groupby完成.这将 节省时间,因为这可以替换许多对np.busday_count的调用(每次调用一次) 组),只需调用np.busday_count.一个函数调用应用于 大数组通常比小数组上的许多函数调用要快.

    • Ignoring the NaTs for the moment, note that the np.busday_count calculation can be done on whole columns of df before applying groupby. This will save time since this replaces many calls to np.busday_count (one for each group) with a single call to np.busday_count. One function call applied to a large array is generally faster than many function calls on small arrays.

      要处理NaT,可以使用pd.notnull标识哪些行 具有NaT s并屏蔽First Date s和Last Date s,以便仅有效 日期发送到np.busday_count.然后,您可以为这些填写NaN 日期具有NaT s的行.

      To handle the NaTs, you could use pd.notnull to identify the rows which have NaTs and mask the First Dates and Last Dates so that only valid dates are sent to np.busday_count. You can then fill in NaNs for those rows where the dates had NaTs.

      计算完所有工作日计数后,我们要做的只是分组 Letter值向下移动一.那可以做到 groupby/transform('shift').

      After we calculate all the business day counts, all we need to do is group by Letter and shift the values down by one. That can be done with groupby/transform('shift').

      import datetime as DT
      import numpy as np
      import pandas as pd
      
      def business_days(start, end):
          mask = pd.notnull(start) & pd.notnull(end)
          start = start.values.astype('datetime64[D]')[mask]
          end = end.values.astype('datetime64[D]')[mask]
          result = np.empty(len(mask), dtype=float)
          result[mask] = np.busday_count(start, end)
          result[~mask] = np.nan
          return result
      
      df = pd.DataFrame(data=[['A', DT.datetime(2016, 1, 7), None],
                              ['A', DT.datetime(2016, 3, 1), DT.datetime(2016, 3, 8)],
                              ['B', DT.datetime(2016, 5, 1), DT.datetime(2016, 5, 10)],
                              ['B', DT.datetime(2016, 6, 5), DT.datetime(2016, 6, 7)]],
                        columns=['Letter', 'First Date', 'Last Date'])
      
      df['Gap'] = business_days(df['First Date'], df['Last Date'])
      print(df)
      #   Letter First Date  Last Date  Gap
      # 0      A 2016-01-07        NaT  NaN
      # 1      A 2016-03-01 2016-03-08  5.0
      # 2      B 2016-05-01 2016-05-10  6.0
      # 3      B 2016-06-05 2016-06-07  1.0
      
      df['Gap'] = df.groupby('Letter')['Gap'].transform('shift')
      print(df)
      

      打印

        Letter First Date  Last Date  Gap
      0      A 2016-01-07        NaT  NaN
      1      A 2016-03-01 2016-03-08  NaN
      2      B 2016-05-01 2016-05-10  NaN
      3      B 2016-06-05 2016-06-07  6.0
      

      这篇关于使用Pandas Groupby和应用功能时处理None值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆