计算python中某个值相对于另一个值的重复出现次数 [英] Count re-occurrence of a value in python aggregated with respect to another value
问题描述
此问题是我在此处:
现在我有类似这样的数据:
Now I have data something like this:
Sno User Cookie
1 1 A
2 1 A
3 1 A
4 1 B
5 1 C
6 1 D
7 1 A
8 1 B
9 1 D
10 1 E
11 1 D
12 1 A
13 2 F
14 2 G
15 2 F
16 2 G
17 2 H
18 2 H
所以可以说我们有 5个用户1的Cookie 'A,B,C,D,E'.现在,我要计算遇到新的cookie后是否再次发生了任何cookie.例如,在上面的示例中,在第7位,然后在第12位,再次遇到了CookieA.注意我们不会同时计数A在第二位,但是在第7位和第12位,我们在再次看到A之前已经看到了许多新的cookie,因此我们计算了该实例.因此,如果我运行上一篇文章中提到的代码,这将是我所得到的:
So lets say we have 5 cookies for user 1 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So this is what I will get if I run code mentioned in my previous post:
对于用户1
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
对于用户2
Sno Cookie Count
6 F 1
7 G 1
8 H 0
现在是棘手的部分,现在我们可以计算出,对于用户1,重复出现了三个Cookie"A,B和D".同样,对于用户2,再次出现"F和G".我想像这样汇总这些结果:
Now comes the tricky part, now we know by the count, that for user 1, three cookies "A, B and D" re-occurred. Similarly for User 2 "F and G" reoccurred. I want to aggregate these results like this:
Sno User Reoccurred_Instances
1 1 3
2 2 2
有没有更简单的方法而无需使用循环来获得此结果.
Is there any easier way without using a loop to get this result.
推荐答案
遵循与我对上一个问题的回答相同的第一步,以消除连续的Cookie
值并查找重复项:
Following the same first steps as I took in my answer to your previous question, to get rid of consecutive Cookie
values and find the duplicates:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
然后使用groupby对确实重复的数据子集(no_doubles[no_doubles['dups']]
)进行User
分组,并使用nunique
为每个用户找到唯一的Cookies
数:
Then use a groupby to group by User
on the subset of data that are indeed duplicated (no_doubles[no_doubles['dups']]
), and find the number of unique Cookies
for each user using nunique
:
no_doubles[no_doubles['dups']].groupby('User')['Cookie'].nunique().reset_index()
这将返回:
User Cookie
0 1 3
1 2 2
您可以根据需要重命名列
You can rename the columns as desired
:
要处理不同的情况,只需添加此逻辑即可.例如,考虑以下在User
数字3中没有重复的数据帧:
To deal with different cases, you can just add to this logic. For example, considering the following dataframe with no repeats in User
number 3:
Sno User Cookie
1 1 A
2 1 A
3 1 A
4 1 B
5 1 C
6 1 D
7 1 A
8 1 B
9 1 D
10 1 E
11 1 D
12 1 A
13 2 F
14 2 G
15 2 F
16 2 G
17 2 H
18 2 H
18 3 H
18 3 I
18 3 J
您可以这样做:
no_doubles = df[(df.Cookie != df.Cookie.shift()) | (df.User != df.User.shift())]
no_doubles['dups'] = no_doubles.duplicated(['Cookie', 'User'])
no_doubles.groupby('User').apply(lambda x: x[x.dups]['Cookie'].nunique()).to_frame('Reoccurred_Instances')
获得:
Reoccurred_Instances
User
1 3
2 2
3 0
这篇关于计算python中某个值相对于另一个值的重复出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!