根据 Pandas 中的日期窗口计算值的累积出现次数 [英] Counting cumulative occurrences of values based on date window in Pandas

查看:33
本文介绍了根据 Pandas 中的日期窗口计算值的累积出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame (df),如下所示:

I have a DataFrame (df) that looks like the following:

+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A  |
| 01-03-17 | B  |
| 01-03-17 | C  |
| 01-05-17 | B  |
| 01-05-17 | D  |
| 01-07-17 | A  |
| 01-07-17 | D  |
| 01-08-17 | C  |
| 01-09-17 | B  |
| 01-09-17 | B  |
+----------+----+

这是我想要计算的最终结果:

This the end result i would like to compute:

+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A  |         1 |
| 01-03-17 | B  |         1 |
| 01-03-17 | C  |         1 |
| 01-05-17 | B  |         2 |
| 01-05-17 | D  |         1 |
| 01-07-17 | A  |         2 |
| 01-07-17 | D  |         2 |
| 01-08-17 | C  |         1 |
| 01-09-17 | B  |         2 |
| 01-09-17 | B  |         3 |
+----------+----+-----------+

逻辑

计算 id 中值在指定时间窗口内的累积出现次数,例如 4 个月.即每 5 个月计数器重置为 1.

Logic

To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.

为了获得累积出现次数,我们可以使用这个 df.groupby('id').cumcount() + 1

To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1

关注 id = B 我们看到 B 的第二次出现是在 2 个月之后,所以 cum_count = 2.B 的下一次出现是在 01-09-17,回顾 4 个月我们只发现了另外一次出现,所以 cum_count = 2 等.

Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.

推荐答案

我的方法是从 df.groupby('id').transform 调用一个辅助函数.我觉得这比它可能的更复杂和更慢,但它似乎有效.

My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.

# test data

    date    id  cum_count_desired
2017-03-01  A   1
2017-03-01  B   1
2017-03-01  C   1
2017-05-01  B   2
2017-05-01  D   1
2017-07-01  A   2
2017-07-01  D   2
2017-08-01  C   1
2017-09-01  B   2
2017-09-01  B   3

# preprocessing

df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]

# solution

def cumcounter(x):
    y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
    gr = x.groupby('date')
    adjust = gr.rank(method='first') - gr.size() 
    y += adjust
    return y

df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)

# output

df[['id', 'id_num', 'cum_count_desired', 'cum_count']]

           id  id_num  cum_count_desired  cum_count
date                                               
2017-03-01  A       0                  1          1
2017-03-01  B       1                  1          1
2017-03-01  C       2                  1          1
2017-05-01  B       1                  2          2
2017-05-01  D       3                  1          1
2017-07-01  A       0                  2          2
2017-07-01  D       3                  2          2
2017-08-01  C       2                  1          1
2017-09-01  B       1                  2          2
2017-09-01  B       1                  3          3

调整

的必要性

如果同一 ID 在同一天出现多次,我使用的切片方法会多计算每个同一天的 ID,因为当列表推导时,基于日期的切片会立即抓取所有当天的值遇到多个 ID 出现的日期.修复:

The need for adjust

If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:

  1. 按日期对当前 DataFrame 进行分组.
  2. 对每个日期组中的每一行进行排名.
  3. 从这些排名中减去每个日期组中的总行数.这会产生一个以日期为索引的升序负整数系列,以 0 结尾.
  4. 将这些非正整数调整添加到 y.

这只影响给定测试数据中的一行——倒数第二行,因为 B 在同一天出现两次.

This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.

要计算与 4 个日历月前一样旧或更新的行数,即包括 4 个月时间间隔的左端点,请保持此行不变:

To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:

y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]

要计算 4 个日历月前新的行,即要排除 4 个月时间间隔的左端点,请改用:

To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:

y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]

这篇关于根据 Pandas 中的日期窗口计算值的累积出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆