计算spark Dataframe中分组数据的分位数 [英] Calculate quantile on grouped data in spark Dataframe
问题描述
我有以下 Spark 数据框:
agent_id|payment_amount|+--------+--------------+|一个|1000||乙|1100||一个|1100||一个|1200||乙|1200||乙|1250||一个|10000||乙|9000|+--------+--------------+
我想要的输出类似于
agen_id 95_quantilea 代理 a 付款的 95 分位数b 无论是代理 b 付款的 95 分位数
对于每组agent_id我需要计算0.95分位数,我采用以下方法:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
但我犯了以下错误:
'GroupedData' 对象没有属性 'approxQuantile'
我需要在新列中有 .95 分位数(百分位数),以便以后可以用于过滤目的
我使用的是 Spark 2.0.0
一种解决方案是使用 percentile_approx
:
注意 1:此解决方案已使用 spark 1.6.2 进行测试,并且需要 HiveContext
.
注意 2 : approxQuantile
在 Spark 中不可用 approxQuantile
pyspark
的 2.0.
注意 3 : percentile
返回组中数字列(包括浮点类型)的近似第 p 个百分位.当 col 中不同值的数量小于第二个参数值时,这会给出一个精确的百分位值.
从 Spark 2+ 开始,HiveContext
不是必需的.
I have the following Spark dataframe :
agent_id|payment_amount|
+--------+--------------+
| a| 1000|
| b| 1100|
| a| 1100|
| a| 1200|
| b| 1200|
| b| 1250|
| a| 10000|
| b| 9000|
+--------+--------------+
my desire output would be something like
agen_id 95_quantile
a whatever is 95 quantile for agent a payments
b whatever is 95 quantile for agent b payments
for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
but I take the following error:
'GroupedData' object has no attribute 'approxQuantile'
I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes
I am using Spark 2.0.0
One solution would be to use percentile_approx
:
>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
>>> df2.show()
# +--------+-----------------+
# |agent_id| approxQuantile|
# +--------+-----------------+
# | a|8239.999999999998|
# | b|7449.999999999998|
# +--------+-----------------+
Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext
.
Note 2 : approxQuantile
isn't available in Spark < 2.0 for pyspark
.
Note 3 : percentile
returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.
EDIT : From Spark 2+, HiveContext
is not required.
这篇关于计算spark Dataframe中分组数据的分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!