使用pandas按id,var1将数据分组为python中的连续日期 [英] Grouping data by id, var1 into consecutive dates in python using pandas

查看:40
本文介绍了使用pandas按id,var1将数据分组为python中的连续日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据看起来像:

I have some data that looks like:

df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
                             "val": [9, 2, 4, 7, 6, 3, 2],
                             "dates": [pd.Timestamp(2020, 1, 1),
                                       pd.Timestamp(2020, 1, 1),
                                       pd.Timestamp(2020, 1, 2),
                                       pd.Timestamp(2020, 1, 2),
                                       pd.Timestamp(2020, 1, 3),
                                       pd.Timestamp(2020, 1, 5),
                                       pd.Timestamp(2020, 3, 12)]})

我想将这些数据分组到 ID 和 var1 中,其中日期是连续的,如果错过了一天,我想开始一个新记录.

I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.

例如最终输出应该是:

df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
                              "val": [13, 2, 13, 3, 2],
                              "start_date": [pd.Timestamp(2020, 1, 1),
                                             pd.Timestamp(2020, 1, 1),
                                             pd.Timestamp(2020, 1, 2),
                                             pd.Timestamp(2020, 1, 5),
                                             pd.Timestamp(2020, 3, 12)],
                              "end_date": [pd.Timestamp(2020, 1, 2),
                                           pd.Timestamp(2020, 1, 1),
                                           pd.Timestamp(2020, 1, 3),
                                           pd.Timestamp(2020, 1, 5),
                                           pd.Timestamp(2020, 3, 12)]})

我已经尝试了几种方法并且一直失败,某些东西可以存在的时间长度是未知的,并且 var1 的可能数量可以随着每个 id 和日期窗口而改变.

I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.

例如,我试图识别这样的连续天数,但它总是返回 ['count_days'] == 0(显然有问题!).然后我想我可以用 date(min) 和 date(min)+count_days 来得到 'start_date' 和 'end_date'

For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'

s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)

我也试过:

df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()

这让我更接近,但我认为这不能解决连续几天的问题,而是提供最早和最晚的一天,不幸的是,这不是我可以推进的.

Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.

添加更多上下文

另一种方法是:

df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)

它提供了一个 ID 和日期列表,但我一直在寻找这个新窗口中的最小最大连续日期

which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window

扩展示例,在组 (102,'a') 的日期范围内有中断.

Extended example that has a break in the date range for group (102,'a').

df_raw_dates = pd.DataFrame(
    {
        "id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
        "var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
        "val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
        "dates": [
            pd.Timestamp(2020, 1, 1),
            pd.Timestamp(2020, 1, 1),
            pd.Timestamp(2020, 1, 2),
            pd.Timestamp(2020, 1, 2),
            pd.Timestamp(2020, 1, 3),
            pd.Timestamp(2020, 1, 5),
            pd.Timestamp(2020, 3, 12),
            pd.Timestamp(2020, 1, 3),
            pd.Timestamp(2020, 1, 7),
            pd.Timestamp(2020, 1, 8),
            pd.Timestamp(2020, 1, 9),
            pd.Timestamp(2020, 1, 21),
            pd.Timestamp(2020, 1, 25),
        ],
    }
)


进一步的例子

这是使用二战中的下面的anwser

This is using the anwser below from wwii

import pandas as pd
import collections

df_raw_dates1 = pd.DataFrame(
    {
        "id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
        "var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
        "val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
        "dates": [
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18)

        ],
    }
)

day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(g.val.sum())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(), date_range['dates'].max()
        val = date_range.val.sum()
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)

print(pd.DataFrame(new_df))

>>>    id var1   val      start        end
0   100    a   0.0 2021-01-22 2021-01-22
1   100    b   0.0 2021-01-22 2021-01-22
2   100    d   0.0 2021-01-22 2021-01-22

3   105    a   0.0 2021-01-22 2021-01-22
4   105    a   1.0 2021-01-21 2021-01-21
5   105    a   0.0 2021-01-20 2021-01-20
6   105    a  10.0 2021-01-19 2021-01-19

7   105    b   2.0 2021-01-22 2021-01-22
8   105    b   9.0 2021-01-21 2021-01-21
9   105    b   5.0 2021-01-20 2021-01-20
10  105    b  13.0 2021-01-19 2021-01-19

从上面我希望第 3、4、5、6 行和 7、8、9、10 行组合在一起.我不知道为什么这个例子现在坏了?

From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?

不确定这个例子和上面的扩展例子有什么区别,为什么这似乎不起作用?

Not sure what the difference with this example and the extended example above is and why this seems to not work?

推荐答案

我没有 Pandas 的超能力,所以我从不尝试做 groupby one-liners,也许有一天.

I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.

调整 SO 问题的公认答案 在Pandas DataFrame - 第一组 ['id','var1'];按连续日期范围对每个组分组.

Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.

import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])

for k,g in gb:
    print(g)
    dt = g['dates']
    # find difference in days between rows
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)

    # create a Series to identify consecutive ranges to group by
    # this cumsum trick can be found in many SO answers
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    # split into date ranges
    date_groups = g.groupby(groups)
    for _,date_range in date_groups:
        print(date_range)
    print(sep)


可以看到 (102,'a') 组被分成了两组.

    id var1  val      dates
0  102    a    9 2020-01-01
2  102    a    4 2020-01-02
7  102    a    1 2020-01-03
     id var1  val      dates
8   102    a    2 2020-01-07
9   102    a    3 2020-01-08
10  102    a    4 2020-01-09


更进一步:在迭代时构造一个字典来创建一个新的 DataFrame.


Going a bit further: while iterating construct a dictionary to make a new DataFrame with.

import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
    # print(g)
    eyed,var = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(g.val.mean())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(),date_range['dates'].max()
        val = date_range.val.mean()
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)


print(pd.DataFrame(new_df))

>>>
    id var1        val      start        end
0  102    a   4.666667 2020-01-01 2020-01-03
1  102    a   3.000000 2020-01-07 2020-01-09
2  102    b   2.000000 2020-01-01 2020-01-01
3  103    a   3.000000 2020-01-05 2020-01-05
4  103    b   6.500000 2020-01-02 2020-01-03
5  104    c   2.000000 2020-03-12 2020-03-12
6  108    a  99.000000 2020-01-21 2020-01-25


看起来很乏味,也许有人会想出一个简洁的解决方案.也许某些操作可以放在函数中,并且可以使用 .apply.transform.pipe 使其更简洁.


Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.

它不考虑具有多个日期但只有单个日期范围的 ('id','var1') 组.例如

It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.

     id var1  val      dates
11  108    a   99 2020-01-21
12  108    a   99 2020-01-25

您可能需要检测日期时间系列中是否存在任何间隙,并使用该事实来适应.

You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.

这篇关于使用pandas按id,var1将数据分组为python中的连续日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆