计算pyspark中列的中位数 [英] Compute median of column in pyspark

查看:51
本文介绍了计算pyspark中列的中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框:

I have a dataframe as shown below:

+-----------+------------+
|parsed_date|       count|
+-----------+------------+
| 2017-12-16|           2|
| 2017-12-16|           2|
| 2017-12-17|           2|
| 2017-12-17|           2|
| 2017-12-18|           1|
| 2017-12-19|           4|
| 2017-12-19|           4|
| 2017-12-19|           4|
| 2017-12-19|           4|
| 2017-12-20|           1|
+-----------+------------+

我想计算整个计数"列的中位数并将结果添加到新列中.

I want to compute median of the entire 'count' column and add the result to a new column.

我试过了:

median = df.approxQuantile('count',[0.5],0.1).alias('count_median')

但当然我做错了什么,因为它给出了以下错误:

But of course I am doing something wrong as it gives the following error:

AttributeError: 'list' object has no attribute 'alias'

请帮忙.

推荐答案

您需要使用 withColumn 添加一列,因为 approxQuantile 返回的是浮点数列表,而不是 Spark列.

You need to add a column with withColumn because approxQuantile returns a list of floats, not a Spark column.

import pyspark.sql.functions as F

df2 = df.withColumn('count_media', F.lit(df.approxQuantile('count',[0.5],0.1)[0]))

df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16|    2|        2.0|
| 2017-12-16|    2|        2.0|
| 2017-12-17|    2|        2.0|
| 2017-12-17|    2|        2.0|
| 2017-12-18|    1|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-20|    1|        2.0|
+-----------+-----+-----------+

您也可以使用approx_percentile/percentile_approx Spark SQL 中的函数:

You can also use the approx_percentile / percentile_approx function in Spark SQL:

import pyspark.sql.functions as F

df2 = df.withColumn('count_media', F.expr("approx_percentile(count, 0.5, 10) over ()"))

df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16|    2|          2|
| 2017-12-16|    2|          2|
| 2017-12-17|    2|          2|
| 2017-12-17|    2|          2|
| 2017-12-18|    1|          2|
| 2017-12-19|    4|          2|
| 2017-12-19|    4|          2|
| 2017-12-19|    4|          2|
| 2017-12-19|    4|          2|
| 2017-12-20|    1|          2|
+-----------+-----+-----------+

这篇关于计算pyspark中列的中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆