PANDAS中的累积集 [英] Cumulative Set in PANDAS

查看：106 发布时间：2020/5/23 23:48:20 python pandas

本文介绍了PANDAS中的累积集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一条发推文的数据框，我希望按日期对数据框进行分组，并生成一列，其中包含该日期之前发布的所有唯一用户的累积列表.现有功能(例如cumsum)似乎都不适合此功能.这是原始tweet数据帧的示例，其中索引(created_at)为日期时间格式:

I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format:

In [3]: df
Out[3]: 
            screen_name 
created_at  
04-01-16    Bob 
04-01-16    Bob
04-01-16    Sally
04-01-16    Sally
04-02-16    Bob
04-02-16    Miguel
04-02-16    Tim

我可以按日期折叠数据集，并获得每天包含唯一身份用户的列:

I can collapse the dataset by date and get a column with the unique users per day:

In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x)))

Out[4]:             from_user_screen_name
        2016-04-02  {Bob, Sally}
        2016-04-03  {Bob, Miguel, Tim}

到目前为止，一切都很好.但是我想要的是这样的累积设置":

So far so good. But what I'd like is to have a "cumulative set" like this:

Out[4]:             Cumulative_list_up_to_this_date   Cumulative_number_of_unique_users
        2016-04-02  {Bob, Sally}                      2
        2016-04-03  {Bob, Sally, Miguel, Tim}         4

最终，我真正感兴趣的是最后一列中的累积数，因此我可以对其进行绘图.我曾经考虑过遍历日期和其他事物，但似乎找不到一种好方法.在此先感谢您的帮助.

Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.

推荐答案

您不能添加集合，但可以添加列表！因此，建立一个用户列表，然后取累加和，最后应用set构造函数消除重复项.

You cannot add sets, but can add lists! So build a list of users, then take the cumulative sum and finally apply the set constructor to get rid of duplicates.

cum_names = (df['screen_name'].groupby(df.index.date)
                              .agg(lambda x: list(x))
                              .cumsum()
                              .apply(set))
# 2016-04-01                 {Bob, Sally}
# 2016-04-02    {Bob, Miguel, Tim, Sally}
# dtype: object

cum_count = cum_names.apply(len)
# 2016-04-01    2
# 2016-04-02    4
# dtype: int64

这篇关于PANDAS中的累积集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PANDAS中的累积集 [英] Cumulative Set in PANDAS

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PANDAS中的累积集 [英] Cumulative Set in PANDAS

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭