如何根据 Pandas 中的列值和时间戳进行顺序计数? [英] How can I do a sequential count based on column value and timestamp in pandas?
问题描述
我希望能够添加一个列,该列根据值按顺序对行进行计数.例如,下面是三个不同的人,他们的记录带有时间戳.我想根据 PersonID 计算记录的顺序.这应该为每个 PersonID 重新启动.(我可以使用 Index() 在 Tableau 中执行此操作,但我也希望它成为原始文件的一部分)
<代码>>人员 ID、日期时间、订单、总计a226 2015-04-16 11:57:36 1 1a226 2015-04-17 15:32:14 2 1a226 2015-04-17 19:13:43 3 1z342 2015-04-15 07:02:20 1 1x391 2015-04-17 13:43:31 1 1x391 2015-04-17 05:12:16 2 1
如果有一种方法可以减去DateTime?我的方法是只选择订单 1 作为数据框,然后只选择订单 2,然后合并,然后减去.有没有办法自动完成?
IIUC,你可以用 cumcount
:
如果你想保证它的时间顺序是递增的,你应该先按 DateTime
排序,但你的例子有 x391 非递增顺序,所以我假设你不想要
如果你确实想要涉及时间戳,我倾向于先排序,以便让生活更轻松:
<预><代码>>>>df["DateTime"] = pd.to_datetime(df["DateTime"]) # 以防万一>>>df = df.sort(["PersonID", "DateTime"])>>>df["订单"] = df.groupby("PersonID").cumcount() + 1>>>dfPersonID 日期时间顺序0 a226 2015-04-16 11:57:36 11 a226 2015-04-17 15:32:14 22 a226 2015-04-17 19:13:43 35 x391 2015-04-17 05:12:16 14 x391 2015-04-17 13:43:31 23 z342 2015-04-15 07:02:20 1即使不进行排序,您也可以在分组列上调用 rank
,它有更多选项来指定您希望如何处理关系:
I would like to be able to add a column which counts rows in order based on a value. For example, below are three different people with records that have a timestamp. I want to count the order of records based on the PersonID. This should restart for every PersonID. (I am able to do this in Tableau with Index() but I want it part of the raw file too)
> PersonID, DateTime, Order, Total
a226 2015-04-16 11:57:36 1 1
a226 2015-04-17 15:32:14 2 1
a226 2015-04-17 19:13:43 3 1
z342 2015-04-15 07:02:20 1 1
x391 2015-04-17 13:43:31 1 1
x391 2015-04-17 05:12:16 2 1
If there is a way to subtract the DateTime as well? My way would be to only select Order 1 as a dataframe, then only select Order 2, then merge, then subtract. Is there a way to do it automatically?
IIUC, you can do a groupby
with cumcount
:
>>> df["Order"] = df.groupby("PersonID").cumcount() + 1
>>> df
PersonID DateTime Order
0 a226 2015-04-16 11:57:36 1
1 a226 2015-04-17 15:32:14 2
2 a226 2015-04-17 19:13:43 3
3 z342 2015-04-15 07:02:20 1
4 x391 2015-04-17 13:43:31 1
5 x391 2015-04-17 05:12:16 2
If you want to guarantee that it's in increasing time order, you should sort by DateTime
first, but your example has x391 in non-increasing order, so I'm assuming you don't want that.
If you do want to involve the timestamps, I tend to sort first, to make life easier:
>>> df["DateTime"] = pd.to_datetime(df["DateTime"]) # just in case
>>> df = df.sort(["PersonID", "DateTime"])
>>> df["Order"] = df.groupby("PersonID").cumcount() + 1
>>> df
PersonID DateTime Order
0 a226 2015-04-16 11:57:36 1
1 a226 2015-04-17 15:32:14 2
2 a226 2015-04-17 19:13:43 3
5 x391 2015-04-17 05:12:16 1
4 x391 2015-04-17 13:43:31 2
3 z342 2015-04-15 07:02:20 1
Even without sorting, though, you could call rank
on the grouped column, which has more options to specify how you want to handle ties:
>>> df["Order"] = df.groupby("PersonID")["DateTime"].rank()
>>> df
PersonID DateTime Order
0 a226 2015-04-16 11:57:36 1
1 a226 2015-04-17 15:32:14 2
2 a226 2015-04-17 19:13:43 3
5 x391 2015-04-17 05:12:16 1
4 x391 2015-04-17 13:43:31 2
3 z342 2015-04-15 07:02:20 1
这篇关于如何根据 Pandas 中的列值和时间戳进行顺序计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!