在 Pyspark/Hive 中使用条件运行总计 [英] Running total with conditional in Pyspark/Hive
本文介绍了在 Pyspark/Hive 中使用条件运行总计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有产品、品牌和百分比列.对于与当前行品牌不同的品牌以及与当前行品牌相同的品牌,我想计算当前行上方各行的百分比列的总和.如何在 PySpark 中或使用 spark.sql 执行此操作?
I have product, brand and percentage columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. How can I do it in PySpark or using using spark.sql?
示例数据:
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
'brand':['b1','b2','b1', 'b3', 'b2','b1'],
'pct': [40, 30, 10, 8,7,5]})
df = spark.createDataFrame(df)
我在寻找什么:
product brand pct pct_same_brand pct_different_brand
a1 b1 40 null null
a2 b2 30 null 40
a3 b1 10 40 30
a4 b3 8 null 80
a5 b2 7 30 58
a6 b1 5 50 45
这是我尝试过的:
df.createOrReplaceTempView('tmp')
spark.sql("""
select *, sum(pct * (select case when n1.brand = n2.brand then 1 else 0 end
from tmp n1)) over(order by pct desc rows between UNBOUNDED PRECEDING and 1
preceding)
from tmp n2
""").show()
推荐答案
你可以通过分区滚动总和减去总滚动总和得到pct_different_brand
列(即pct_same_brand
> 列):
You can get the pct_different_brand
column by subtracting the total rolling sum from the partitioned rolling sum (i.e. pct_same_brand
column):
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'pct_same_brand',
F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand'), F.lit(0))
)
df2.show()
+---+-----+---+--------------+-------------------+
| a|brand|pct|pct_same_brand|pct_different_brand|
+---+-----+---+--------------+-------------------+
| a1| b1| 40| null| null|
| a2| b2| 30| null| 40|
| a3| b1| 10| 40| 30|
| a4| b3| 8| null| 80|
| a5| b2| 7| 30| 58|
| a6| b1| 5| 50| 45|
+---+-----+---+--------------+-------------------+
这篇关于在 Pyspark/Hive 中使用条件运行总计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文