在 Pyspark/Hive 中使用条件运行总计 [英] Running total with conditional in Pyspark/Hive

查看:34
本文介绍了在 Pyspark/Hive 中使用条件运行总计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有产品、品牌和百分比列.对于与当前行品牌不同的品牌以及与当前行品牌相同的品牌,我想计算当前行上方各行的百分比列的总和.如何在 PySpark 中或使用 spark.sql 执行此操作?

I have product, brand and percentage columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. How can I do it in PySpark or using using spark.sql?

示例数据:

df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
              'brand':['b1','b2','b1', 'b3', 'b2','b1'],
          'pct': [40, 30, 10, 8,7,5]})
df = spark.createDataFrame(df)

我在寻找什么:

product  brand  pct  pct_same_brand  pct_different_brand
a1       b1     40     null           null
a2       b2     30     null           40
a3       b1     10     40             30
a4       b3     8      null           80
a5       b2     7      30             58
a6       b1     5      50             45                                       

这是我尝试过的:

df.createOrReplaceTempView('tmp')
spark.sql("""
select *, sum(pct * (select case when n1.brand = n2.brand then 1 else 0 end 
from tmp n1)) over(order by pct desc rows between UNBOUNDED PRECEDING and 1 
preceding) 
from tmp n2
""").show()

推荐答案

你可以通过分区滚动总和减去总滚动总和得到pct_different_brand列(即pct_same_brand> 列):

You can get the pct_different_brand column by subtracting the total rolling sum from the partitioned rolling sum (i.e. pct_same_brand column):

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'pct_same_brand', 
    F.sum('pct').over(
        Window.partitionBy('brand')
              .orderBy(F.desc('pct'))
              .rowsBetween(Window.unboundedPreceding, -1)
    )
).withColumn(
    'pct_different_brand', 
    F.sum('pct').over(
        Window.orderBy(F.desc('pct'))
              .rowsBetween(Window.unboundedPreceding, -1)
    ) - F.coalesce(F.col('pct_same_brand'), F.lit(0))
)

df2.show()

+---+-----+---+--------------+-------------------+
|  a|brand|pct|pct_same_brand|pct_different_brand|
+---+-----+---+--------------+-------------------+
| a1|   b1| 40|          null|               null|
| a2|   b2| 30|          null|                 40|
| a3|   b1| 10|            40|                 30|
| a4|   b3|  8|          null|                 80|
| a5|   b2|  7|            30|                 58|
| a6|   b1|  5|            50|                 45|
+---+-----+---+--------------+-------------------+

这篇关于在 Pyspark/Hive 中使用条件运行总计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆