Pyspark:按列加权平均 [英] Pyspark: weighted average by a column

查看:27
本文介绍了Pyspark:按列加权平均的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有一个这样的数据集

For example, I have a dataset like this

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (3, 3, 1, "2018-06-01", "Region A"),
    (3, 1, 3, "2018-06-05", "Region A"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

我可以通过

overall_stat = test.groupBy("customerid").agg(count("orderid"))\
  .withColumnRenamed("count(orderid)", "overall_count")
temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0).join(overall_stat, ["customerid"])

for field in temp_result.schema.fields:
    if str(field.name) not in ['customerid', "overall_count", "overall_amount"]:
        name = str(field.name)
        temp_result = temp_result.withColumn(name, col(name)/col("overall_count"))
temp_result.show()

数据看起来像这样

现在,我想通过overall_count计算加权平均值,我该怎么做?

Now, I want to calculate the weighted average by the overall_count, how can I do it?

区域A的结果应该是(0.66*3+1*1)/4,区域A的结果应该是(0.33*3+1*1)/4

The result should be (0.66*3+1*1)/4 for region A, and (0.33*3+1*1)/4 for region B

我的想法:

当然可以通过将数据转成python/pandas然后进行一些计算来实现,但是在什么情况下我们应该使用Pyspark?

It can certainly be achieved through turning the data into python/pandas and then do some calculation, but in what cases should we use Pyspark?

我可以得到类似的东西

temp_result.agg(sum(col("Region A") * col("overall_count")), sum(col("Region B")*col("overall_count"))).show()

但感觉不太对,尤其是在有很多region要计算的情况下.

but it doesn't feel right, especially if there is many regions to count.

推荐答案

您可以通过将上述步骤分成多个阶段来获得加权平均值.

you can achieve a weighted average by breaking your above steps into multiple stages.

考虑以下事项:

Dataframe Name: sales_table
[ total_sales, count_of_orders, location]
[     50     ,       9        ,    A    ]
[     80     ,       4        ,    A    ]
[     90     ,       7        ,    A    ]

计算上述(70)的分组加权平均分为两个步骤:

To calculate the grouped weighted average of the above (70) is broken into two steps:

  1. sales乘以importance
  2. 汇总 sales_x_count 产品
  3. sales_x_count 除以原始的总和
  1. Multiplying sales by importance
  2. Aggregating the sales_x_count product
  3. Dividing sales_x_count by the sum of the original

如果我们在 PySpark 代码中将上述内容分为几个阶段,您可以获得以下内容:

If we break the above into several stages within our PySpark code, you can get the following:

new_sales = sales_table \
    .withColumn("sales_x_count", col("total_sales") * col("count_orders")) \
    .groupBy("Location") \
    .agg(sf.sum("total_sales").alias("sum_total_sales"), \
         sf.sum("sales_x_count").alias("sum_sales_x_count")) \
    .withColumn("count_weighted_average", col("sum_sales_x_count") / col("sum_total_sales"))

所以......这里真的不需要花哨的UDF(并且可能会减慢你的速度).

So... no fancy UDF is really necessary here (and would likely slow you down).

这篇关于Pyspark:按列加权平均的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆