计算pandas中groupby的nunique() [英] Calculate nunique() for groupby in pandas
问题描述
我有一个带有列的数据框:
I have a dataframe with columns:
-
diff
-注册日期和付款日期之间的差额,以天为单位 -
country
-用户所在的国家/地区 -
user_id
-
campaign_id
-另一个分类列,我们将在groupby中使用它
diff
- difference between registration date and payment date,in dayscountry
- country of useruser_id
campaign_id
-- another categorical column, we will use it in groupby
我需要计算具有diff
< = n的每个country
+ campaign_id
组的独立用户数.
例如,对于country
'A',campaign
'abc'和diff
7,我需要从country
'A',campaign
'abc'和diff
获得计数不同的用户< = 7
I need to calculate count distinct users for every country
+campaign_id
group who has diff
<=n.
For example, for country
'A', campaign
'abc' and diff
7 i need to get count distinct users from country
'A', campaign
'abc' and diff
<= 7
我当前的解决方案(如下)工作时间太长
My current solution(below) works too long
import pandas as pd
import numpy as np
## generate test dataframe
df = pd.DataFrame({
'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
'diff':np.random.choice(range(10), 10000),
'user_id': np.random.choice(range(1000), 10000)
})
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
也许有更好的方法可以做到这一点?
Maybe there is better way to do this?
推荐答案
通过 assign
进行连接,然后使用 reindex
用于自定义列顺序:
First use list comprehension with concat
and assign
for join all together and then groupby
with nunique
with adding column diff
, last rename columns and if necessary add reindex
for custom columns order:
df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id':'unique_ppl'})
.reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
这篇关于计算pandas中groupby的nunique()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!