pandas 中唯一值的累积计数 [英] Cumulative count of unique values in pandas
问题描述
我想逐周累计计算熊猫框架中一列的唯一值.例如,假设我有这样的数据:
I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
我想要的是一个连续的计数,直到每周一次,即唯一的module_id的数量,例如:
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
将其作为循环直接进行是很简单的,例如,可以这样做:
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
但是我的真实数据帧很大,所以我想使用矢量化算法而不是循环.
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
在此处,有一个类似的听起来的问题,但要查看已接受的答案(
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
我该如何在大熊猫中进行矢量化处理?
How would I do this vectorised in pandas?
推荐答案
通过两个列在每个组中创建list
个想法,然后将np.cumsum
用于累积列表,最后将值转换为集合并获取长度:>
Idea is create list
s per groups by both columns and then use np.cumsum
for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
这篇关于 pandas 中唯一值的累积计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!