如何使用 Python Dataframe API 在 Apache Spark 中找到中位数? [英] How to find the median in Apache Spark with Python Dataframe API?

查看:25
本文介绍了如何使用 Python Dataframe API 在 Apache Spark 中找到中位数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pyspark API 提供了许多聚合函数,除了中位数.Spark 2 带有 approxQuantile ,它给出了近似的分位数,但精确的中位数计算起来非常昂贵.是否有更多 Pyspark 方法来计算 Spark Dataframe 中一列值的中位数?

Pyspark API provides many aggregate functions except the median. Spark 2 comes with approxQuantile which gives approximate quantiles but exact median is very expensive to calculate. Is there a more Pyspark way of calculating median for a column of values in a Spark Dataframe?

推荐答案

这是在 Python (Spark 1.6 +) 中使用 Dataframe API 的示例实现.

Here is an example implementation with Dataframe API in Python (Spark 1.6 +).

import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import FloatType

假设我们在salaries"spark 数据框中有客户的月薪,例如:

Let's assume we have monthly salaries for customers in "salaries" spark dataframe such as:

月 |customer_id |工资

我们想找出所有月份每位客户的工资中位数

and we would like to find the median salary per customer throughout all the months

Step1:编写一个用户定义的函数来计算中值

Step1: Write a user defined function to calculate the median

def find_median(values_list):
    try:
        median = np.median(values_list) #get the median of values in a list in each row
        return round(float(median),2)
    except Exception:
        return None #if there is anything wrong with the given values

median_finder = F.udf(find_median,FloatType())

第 2 步:通过将工资列收集到每行的工资列表中来聚合:

Step 2: Aggregate on the salary column by collecting them into a list of salaries in each row:

salaries_list = salaries.groupBy("customer_id").agg(F.collect_list("salary").alias("salaries"))

第 3 步:在薪水列上调用中值查找器 udf 并将中值添加为新列

Step 3: Call the median_finder udf on the salaries column and add the median values as a new column

salaries_list = salaries_list.withColumn("median",median_finder("salaries")) 

这篇关于如何使用 Python Dataframe API 在 Apache Spark 中找到中位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆