pandas 中唯一值的累积计数 [英] Cumulative count of unique values in pandas

查看：70 发布时间：2020/5/24 3:15:38 pandas

本文介绍了 pandas 中唯一值的累积计数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想逐周累计计算熊猫框架中一列的唯一值.例如，假设我有这样的数据:

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:

df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})

+---+---------+------+-----------+
|   | user_id | week | module_id |
+---+---------+------+-----------+
| 0 |       1 |    1 |         A |
| 1 |       1 |    1 |         B |
| 2 |       1 |    2 |         A |
| 3 |       2 |    1 |         A |
| 4 |       2 |    2 |         B |
| 5 |       2 |    2 |         C |
+---+---------+------+-----------+

我想要的是一个连续的计数，直到每周一次，即唯一的module_id的数量，例如:

What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:

+---+---------+------+-------------------------+
|   | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 |       1 |    1 |                       2 |
| 1 |       1 |    2 |                       2 |
| 2 |       2 |    1 |                       1 |
| 3 |       2 |    2 |                       3 |
+---+---------+------+-------------------------+

将其作为循环直接进行是很简单的，例如，可以这样做:

It is straightforward to do this as a loop, for example this works:

running_tally = {}
result = {}
for index, row in df.iterrows():
    if row['user_id'] not in running_tally:
        running_tally[row['user_id']] = set()
        result[row['user_id']] = {}
    running_tally[row['user_id']].add(row['module_id'])
    result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)

{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}

但是我的真实数据帧很大，所以我想使用矢量化算法而不是循环.

But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.

在此处，有一个类似的听起来的问题，但要查看已接受的答案(

There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.

我该如何在大熊猫中进行矢量化处理?

How would I do this vectorised in pandas?

推荐答案

通过两个列在每个组中创建list个想法，然后将np.cumsum用于累积列表，最后将值转换为集合并获取长度:

Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:

df1 = (df.groupby(['user_id','week'])['module_id']
         .apply(list)
         .groupby(level=0)
         .apply(np.cumsum)
         .apply(lambda x: len(set(x)))
         .reset_index(name='cumulative_module_count'))

print (df1)
   user_id  week  cumulative_module_count
0        1     1                        2
1        1     2                        2
2        2     1                        1
3        2     2                        3

这篇关于 pandas 中唯一值的累积计数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 中唯一值的累积计数 [英] Cumulative count of unique values in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 中唯一值的累积计数 [英] Cumulative count of unique values in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭