使用 pandas 按日期范围分组 [英] Grouping by date range with pandas
问题描述
我希望按两列分组:user_id 和 date;但是,如果日期足够接近,我希望能够相应地考虑同一组和组的两个条目部分.日期是 m-d-y
user_id 日期 val1 1-1-17 12 1-1-17 13 1-1-17 11 1-1-17 11 1-2-17 12 1-2-17 12 1-10-17 13 2-1-17 1
分组将按 user_id 和彼此相距 +/- 3 天的日期分组.所以通过总结 val 的组看起来像:
user_id 日期总和(val)1 1-2-17 32 1-2-17 22 1-10-17 13 1-1-17 13 2-1-17 1
有人能想到这可以(有点)轻松地完成吗?我知道这有一些问题.例如,如果日期无休止地串在一起,相隔三天,该怎么办.但我使用的确切数据每人只有 2 个值..
谢谢!
我会将其转换为 datetime
列,然后使用 pd.TimeGrouper
:
dates = pd.to_datetime(df.date, format='%m-%d-%y')打印(日期)0 2017-01-011 2017-01-012 2017-01-013 2017-01-014 2017-01-025 2017-01-026 2017-01-107 2017-02-01名称:日期,数据类型:datetime64[ns]df = (df.assign(date=dates).set_index('date').groupby(['user_id', pd.TimeGrouper('3D')]).和().reset_index())打印(df)user_id 日期 val0 1 2017-01-01 31 2 2017-01-01 22 2 2017-01-10 13 3 2017-01-01 14 3 2017-01-31 1
<小时>
使用pd.Grouper
的类似解决方案:
df = (df.assign(date=dates).groupby(['user_id', pd.Grouper(key='date', freq='3D')]).和().reset_index())打印(df)user_id 日期 val0 1 2017-01-01 31 2 2017-01-01 22 2 2017-01-10 13 3 2017-01-01 14 3 2017-01-31 1
更新:TimeGrouper
将在 Pandas 的未来版本中被弃用,因此 Grouper
在这种情况下将是首选(感谢提醒,Vaishali!).>
I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime
column and then use pd.TimeGrouper
:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper
:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper
will be deprecated in future versions of pandas, so Grouper
would be preferred in this scenario (thanks for the heads up, Vaishali!).
这篇关于使用 pandas 按日期范围分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!