计算pandas中groupby的nunique() [英] Calculate nunique() for groupby in pandas

查看:615
本文介绍了计算pandas中groupby的nunique()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有列的数据框:

I have a dataframe with columns:

  1. diff-注册日期和付款日期之间的差额,以天为单位
  2. country-用户所在的国家/地区
  3. user_id
  4. campaign_id-另一个分类列,我们将在groupby中使用它
  1. diff - difference between registration date and payment date,in days
  2. country - country of user
  3. user_id
  4. campaign_id -- another categorical column, we will use it in groupby

我需要计算具有diff< = n的每个country + campaign_id组的独立用户数. 例如,对于country'A',campaign'abc'和diff 7,我需要从country'A',campaign'abc'和diff 获得计数不同的用户< = 7

I need to calculate count distinct users for every country+campaign_id group who has diff<=n. For example, for country 'A', campaign 'abc' and diff 7 i need to get count distinct users from country 'A', campaign 'abc' and diff <= 7

我当前的解决方案(如下)工作时间太长

My current solution(below) works too long

import pandas as pd
import numpy as np

## generate test dataframe
df = pd.DataFrame({
        'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
        'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
        'diff':np.random.choice(range(10), 10000),
        'user_id': np.random.choice(range(1000), 10000)
        })
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
    tmp_df = df.loc[df['diff']<=diff,:]
    tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
    tmp_df['diff'] = diff
    tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
    result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

也许有更好的方法可以做到这一点?

Maybe there is better way to do this?

推荐答案

通过groupby进行连接. SeriesGroupBy.nunique.html"rel =" nofollow noreferrer> nunique ,其中添加列diff,最后重命名列,并在必要时添加

First use list comprehension with concat and assign for join all together and then groupby with nunique with adding column diff, last rename columns and if necessary add reindex for custom columns order:

df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
          .nunique()
          .reset_index()
          .rename(columns={'user_id':'unique_ppl'})
          .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))

这篇关于计算pandas中groupby的nunique()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆