计算spark Dataframe中分组数据的分位数 [英] Calculate quantile on grouped data in spark Dataframe

查看:102
本文介绍了计算spark Dataframe中分组数据的分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 Spark 数据框:

 agent_id|payment_amount|+--------+--------------+|一个|1000||乙|1100||一个|1100||一个|1200||乙|1200||乙|1250||一个|10000||乙|9000|+--------+--------------+

我想要的输出类似于

agen_id 95_quantilea 代理 a 付款的 95 分位数b 无论是代理 b 付款的 95 分位数

对于每组agent_id我需要计算0.95分位数,我采用以下方法:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

但我犯了以下错误:

'GroupedData' 对象没有属性 'approxQuantile'

我需要在新列中有 .95 分位数(百分位数),以便以后可以用于过滤目的

我使用的是 Spark 2.0.0

解决方案

一种解决方案是使用 percentile_approx :

<预><代码>>>>test_df.registerTempTable("df")>>>df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")>>>df2.show()# +--------+-----------------+# |agent_id|大约分位数|# +--------+-----------------+# |a|8239.999999999998|# |b|7449.999999999998|# +--------+-----------------+

注意 1:此解决方案已使用 spark 1.6.2 进行测试,并且需要 HiveContext.

注意 2 : approxQuantile 在 Spark 中不可用 approxQuantilepyspark 的 2.0.

注意 3 : percentile 返回组中数字列(包括浮点类型)的近似第 p 个百分位.当 col 中不同值的数量小于第二个参数值时,这会给出一个精确的百分位值.

Spark 2+ 开始,HiveContext 不是必需的.

I have the following Spark dataframe :

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|          1100|
|       a|          1200|
|       b|          1200|
|       b|          1250|
|       a|         10000|
|       b|          9000|
+--------+--------------+

my desire output would be something like

agen_id   95_quantile
  a          whatever is 95 quantile for agent a payments
  b          whatever is 95 quantile for agent b payments

for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but I take the following error:

'GroupedData' object has no attribute 'approxQuantile'

I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

I am using Spark 2.0.0

解决方案

One solution would be to use percentile_approx :

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+ 

Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

EDIT : From Spark 2+, HiveContext is not required.

这篇关于计算spark Dataframe中分组数据的分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆