获取 pandas 中每个不同组的事件总和 [英] Getting a sum of events per different groups in pandas

查看：44 发布时间：2021/6/13 20:20:56 python pandas

本文介绍了获取 pandas 中每个不同组的事件总和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这样的结构:

country product     date_install    date_purchase   user_id
BR      yearly      2020-11-01      2020-11-01      10660236
CA      monthly     2020-11-01      2020-11-01      10649441
US      yearly      2020-11-01      trialed         10660272
IT      monthly     2020-11-01      2020-11-01      10657634
AE      monthly     2020-11-01      2020-11-01      10661442
IT      monthly     2020-11-01      trialed         10657634
AE      monthly     2020-11-01      trialed         10661442

我正在尝试获取每个国家、产品、日期的购买/安装比率以及安装和购买的实际数量.date_install 是安装日期，date_purchase 确定购买的日期和购买已经发生的日期，date_purchasetrialed 的值/code> 表示没有为具有 user_id 的用户进行购买.


I am trying to get the ratio of purchases / installs and the actual numbers of installs and purchase per country, product, date. date_install is a date of installation, date_purchase determines the day of the purchase and that the purchase has happened, trialed value indate_purchase means that no purchase was made for a user with user_id.
所需的输出应如下所示:
Desired output should look like this:
country product     date_install        installs    purchases       ratio
US      daily       2021-02-05          100         20              0.2
US      monthly     2021-02-05          100         50              0.5
US      yearly      2021-02-05          100         50              0.5             
US      trialed     2021-02-05          100         0               0    
# the next day
US      daily       2021-02-06          500         50              0.1
US      monthly     2021-02-06          500         100             0.2
US      yearly      2021-02-06          500         250             0.5             
US      trialed     2021-02-06          500         0               0    
# the rest of the countries & the rest of the days

installs 将是当天、国家和地区的 date_install 总计数的数量.product, purchases 将是每天、国家和地区的 date_purchase 事件总数.产品.
The installs would be a number of the total date_install count for that day, country & product, purchases would be a number of total date_purchase events for each day, country& product.
这个想法是对于给定的国家和地区某天 X 人安装了一个应用程序，其中一些人购买了产品，而另一些人没有.购买过的有日期的date_purchase 值，而没有trialed 的值.但安装应用程序的总和是每个国家/地区、产品和应用程序的 date_install 计数.date_install.
The idea is that for a given country & day X people have installed an app, some of those have purchased a product and some not. Those who have purchased have a date_purchase value of a date, whereas those who did not have trialed value. But the total sum of those, who have installed an app is the count of date_install per country, product & date_install.
我尝试了什么:
exp = df.groupby(['country','product','date_install']).count()
.sort_values('date_install',ascending=False).reset_index()

exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()
exp['total_installs'] = exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()

但我收到一个错误:
ValueError: 错误数量的项目通过 4，放置意味着 1

ValueError: Wrong number of items passed 4, placement implies 1
而且我不认为我试图实现这一目标的方式是正确的.达到预期结果的最佳方式/逻辑是什么?
And I don't think that the way I am trying to achieve this is the correct one. What would be the best way / logic to achieve the desired result?
更新
使用@jezrael 的回答后:
After using @jezrael's answer:
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()


exp[(exp['date_install']=='2020-11-18') & (exp['country']=='US')]

退货
country     product         date_install        installs    purchases   ratio
US          catalog30US     2020-11-18          1           1           1.0
US          trialed         2020-11-18          4924        0           0.0
US          renders.100     2020-11-18          2           2           1.0
US          renders.20      2020-11-18          3           3           1.0
US          monthly         2020-11-18          37          37          1.0
US          yearly          2020-11-18          6           6           1.0
US          textures        2020-11-18          1           1           1.0

这不是真的，因为每一行中的 installs 应该是给定的一组 country & 的总安装量的总和.date_install.
Which is not true, since the installs in each row should be the sum of total installs for a given set of country & date_install.
在回报中，我更新了国家和地区的安装值day 需要是 country & 的所有安装的总和天，在这种情况下，安装的每个值都需要 1+4924+2+3+37+6+1，这将是给定国家和地区的真实安装.天，然后比率就有意义了，现在 installs == purchase，这是不正确的.我试图回答:对于给定的日期和国家，有多少人安装了&购买了不同的产品，它们的比例是多少.
In the return I've updated the installs values for country & day needs to be the sum of all installs for country & day, in that case, each value for the installs needs to be 1+4924+2+3+37+6+1, which would be the true installs for a given country & day, and then the ratios would make sense, now installs == purchases, which is not true. I am trying to answer: for a given day and country, how many people installed & purchased different products and what is their ratio.
我需要它:
country     product         date_install        installs    purchases   ratio
US          catalog30US     2020-11-18          4974        1           1 / 4974
US          trialed         2020-11-18          4974        0           0.0
US          renders.100     2020-11-18          4974        2           2 / 4974
US          renders.20      2020-11-18          4974        3           3 / 4974
US          monthly         2020-11-18          4974        37          37 / 4974
US          yearly          2020-11-18          4974        6           6 / 4974
US          textures        2020-11-18          4974        1           1 / 4974


推荐答案
我认为您需要通过 GroupBy.size 用于计数缺失值和 GroupBy.count 用于排除缺失的计数值，然后分列:
I think you need aggregate by GroupBy.size for count with missing values and GroupBy.count for counts with exclude missing values and then divide columns:
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install'])
         .agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))

#sum per country and install date
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])

exp = exp.reset_index()
print (exp)


                        这篇关于获取 pandas 中每个不同组的事件总和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

获取 pandas 中每个不同组的事件总和 [英] Getting a sum of events per different groups in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

获取 pandas 中每个不同组的事件总和 [英] Getting a sum of events per different groups in pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭