如何有条件地汇总 pandas 查询的投影部分中的值? [英] How do I conditionally aggregate values in projection part of pandas query?
问题描述
我目前有一个包含以下内容的csv文件:
I currently have a csv file with this content:
ID PRODUCT_ID NAME STOCK SELL_COUNT DELIVERED_BY
1 P1 PRODUCT_P1 12 15 UPS
2 P2 PRODUCT_P2 4 3 DHL
3 P3 PRODUCT_P3 120 22 DHL
4 P1 PRODUCT_P1 423 18 UPS
5 P2 PRODUCT_P2 0 5 GLS
6 P3 PRODUCT_P3 53 10 DHL
7 P4 PRODUCT_P4 22 0 UPS
8 P1 PRODUCT_P1 94 56 GLS
9 P1 PRODUCT_P1 9 24 GLS
当我执行此SQL查询时:
When I execute this SQL query:
SELECT
PRODUCT_ID,
MIN(CASE WHEN DELIVERED_BY = 'UPS' THEN STOCK END) as STOCK,
SUM(CASE WHEN ID > 6 THEN SELL_COUNT END) as TOTAL_SELL_COUNT,
SUM(CASE WHEN SELL_COUNT * 100 > 1000 THEN SELL_COUNT END) as COND_SELL_COUNT
FROM products
GROUP BY PRODUCT_ID;
我得到了预期的结果:
PRODUCT_ID STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
P1 12 80 113
P2 null null null
P3 null null 22
P4 22 0 null
现在,我正尝试使用熊猫以某种方式在该数据集上获得相同的结果,这就是我正在努力解决的问题.
Now I'm trying to somehow get the same result on that dataset using pandas, and that's what I'm struggling with.
我将csv文件导入到名为df_products的da DataFrame中. 然后我尝试了这个:
I imported the csv file to da DataFrame called df_products. Then I tried this:
def custom_aggregate(grouped):
data = {
'STOCK': np.where(grouped['DELIVERED_BY'] == 'UPS', grouped['STOCK'].min(), np.nan) # [grouped['STOCK'].min() if grouped['DELIVERED_BY'] == 'UPS' else None]
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
print(result)
您可以看到,由于要根据DELIVERED_BY值使有条件的STOCK聚集工作,我已经遇到了问题,我远远没有达到预期的结果.
As you can see I'm nowhere near the expected result as I'm already having problems getting the conditional STOCK aggregration to work depending on the DELIVERED_BY values.
这将输出:
STOCK
PRODUCT_ID
P1 [9.0, 9.0, nan, nan]
P2 [nan, nan]
P3 [nan, nan]
P4 [22.0]
这甚至不是正确的格式,但是如果我能得到预期的P1的12.0而不是9.0,我会很高兴.
which is not even in the correct format, but I'd be happy if I could get the expected 12.0 instead of 9.0 for P1.
谢谢
我只想补充一下,我通过创建其他列来接近结果:
I just wanted to add that I got near the result by creating additional columns:
df_products['COND_STOCK'] = df_products[df_products['DELIVERED_BY'] == 'UPS']['STOCK']
df_products['SELL_COUNT_ID_GT6'] = df_products[df_products['ID'] > 6]['SELL_COUNT']
df_products['SELL_COUNT_GT1000'] = df_products[(df_products['SELL_COUNT'] * 100) > 1000]['SELL_COUNT']
该函数将如下所示:
def custom_aggregate(grouped):
data = {
'STOCK': grouped['COND_STOCK'].min(),
'TOTAL_SELL_COUNT': grouped['SELL_COUNT_ID_GT6'].sum(),
'COND_SELL_COUNT': grouped['SELL_COUNT_GT1000'].sum(),
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
这是几乎"想要的结果:
This is the 'almost' desired result:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN 0.0 0.0
P3 NaN 0.0 22.0
P4 22.0 0.0 0.0
推荐答案
通常,我们可以将熊猫写如下
Usually we can write the pandas as below
df.groupby('PRODUCT_ID').apply(lambda x : pd.Series({'STOCK':x.loc[x.DELIVERED_BY =='UPS','STOCK'].min(),
'TOTAL_SELL_COUNT': x.loc[x.ID>6,'SELL_COUNT'].sum(min_count=1),
'COND_SELL_COUNT':x.loc[x.SELL_COUNT>10,'SELL_COUNT'].sum(min_count=1)}))
出[105]:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN NaN NaN
P3 NaN NaN 22.0
P4 22.0 0.0 NaN
这篇关于如何有条件地汇总 pandas 查询的投影部分中的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!