如何为 pandas 数据框中的不同组分配唯一的ID? [英] How to assign a unique ID for different groups in pandas dataframe?
问题描述
如何根据某些条件将唯一ID分配给在pandas数据框中创建的组. 例如:我有一个名为df的数据框,其结构如下:名称标识用户,datetime标识用户访问资源的日期/时间.
How to assign unique IDs to groups created in pandas dataframe based on certain conditions. For example: I have a dataframe named as df with the following structure:Name identifies the user, and datetime identifies the date/time at which the user is accessing a resource.
Name Datetime
Bob 26-04-2018 12:00:00
Claire 26-04-2018 12:00:00
Bob 26-04-2018 12:10:00
Bob 26-04-2018 12:30:00
Grace 27-04-2018 08:30:00
Bob 27-04-2018 09:30:00
Bob 27-04-2018 09:40:00
Bob 27-04-2018 10:00:00
Bob 27-04-2018 10:30:00
Bob 27-04-2018 11:30:00
我想为用户创建会话,以便为具有相同名称和日期时间值的用户访问资源的时间不超过30分钟,将为其分配一个唯一的会话.但是,如果用户在访问资源时显示超过30分钟的不活动状态,则将在下一次用户访问资源时为同一用户分配不同的会话.
I would like to create sessions for the users such that, users with same name and datetime values accessing the resource do not exceed more than 30 minutes would be assigned a unique session. However, if the user shows some inactivity for more than 30 minutes in accessing the resource, the same user would be assigned a different session for the next time user access the resource.
我的预期输出将如下所示.
My expected output would be as shown.
用户Bob在2018年4月27日访问的资源是9.30,第二次@ 9.40,第三次@ 10.00,第四次@ 10.30->所有都与会话4关联.但是下一次用户Bob在@ 11.30上访问时差很大由于Bob闲置了30分钟以上,因此超过了30分钟,因此下一次会话将分配给他.
User Bob on 27-04-2018, accessed the resource at 9.30, second time @ 9.40, third time @ 10.00, fourth time @10.30 -> all with Session 4. But next time user Bob access @ 11.30 so time difference exceeds 30 minutes as Bob has been inactive for more than 30 minutes, so next session would be assigned to him.
Name Datetime Id
Bob 26-04-2018 12:00:00 1
Claire 26-04-2018 12:00:00 2
Bob 26-04-2018 12:10:00 1
Bob 26-04-2018 12:30:00 1
Grace 27-04-2018 08:30:00 3
Bob 27-04-2018 09:30:00 4
Bob 27-04-2018 09:40:00 4
Bob 27-04-2018 10:00:00 4
Bob 27-04-2018 10:30:00 4
Bob 27-04-2018 11:30:00 5
谢谢您的帮助! 链接到上一个问题:如何在熊猫数据框中将第二列的值与第一列的值进行比较?
Thank you for your help! Link to previous question: How to compare value of second column with same values of first column in pandas dataframe?
推荐答案
您在底部的解释对理解它很有帮助.
Your explanation at the near bottom is really helpful to understand it.
您需要在Name
和groupID
上进行分组(不要将此groupID
与最终的Id
混淆),然后调用ngroup
返回Id
.最主要的是如何定义此groupID
.要创建groupID
,您需要sort_values
将每个Name
和Datetime
分成升序.对Name
进行分组,并找到每组Name
(在同一Name
中)的连续行之间的Datetime
差异.使用gt
检查大于30分钟,使用cumsum
检查groupID
. sort_index
还原为原始顺序并分配给s
,如下所示:
You need to groupby on Name
and a groupID
(don't confuse this groupID
with your final Id
) and call ngroup
to return Id
. The main thing is how to define this groupID
. To create groupID
, you need sort_values
to separate each Name
and Datetime
into ascending order. Groupby Name
and find differences in Datetime
between consecutive rows within each group of Name
(within the same Name
). Using gt
to check greater than 30mins and cumsum
to get groupID
. sort_index
to reverse back to original order and assign to s
as follows:
s = df.sort_values(['Name','Datetime']).groupby('Name').Datetime.diff() \
.gt(pd.Timedelta(minutes=30)).cumsum().sort_index()
接下来,将Name
和s
与sort=False
进行分组以保留原始顺序并调用ngroup
加1.
Next, groupby Name
and s
with sort=False
to reserve the original order and call ngroup
plus 1.
df['Id'] = df.groupby(['Name', s], sort=False).ngroup().add(1)
Out[834]:
Name Datetime Id
0 Bob 2018-04-26 12:00:00 1
1 Claire 2018-04-26 12:00:00 2
2 Bob 2018-04-26 12:10:00 1
3 Bob 2018-04-26 12:30:00 1
4 Grace 2018-04-27 08:30:00 3
5 Bob 2018-04-27 09:30:00 4
6 Bob 2018-04-27 09:40:00 4
7 Bob 2018-04-27 10:00:00 4
8 Bob 2018-04-27 10:30:00 4
9 Bob 2018-04-27 11:30:00 5
这篇关于如何为 pandas 数据框中的不同组分配唯一的ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!