如何计算Spark中DataFrame中列的百分位数? [英] How to calculate Percentile of column in a DataFrame in spark?

查看：700 发布时间：2020/9/4 6:51:15 scala apache-spark apache-spark-sql spark-dataframe

本文介绍了如何计算Spark中DataFrame中列的百分位数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我正在尝试计算DataFrame中列的百分位数?我在Spark聚合函数中找不到任何percentile_approx函数.

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions.

例如在Hive中，我们有percentile_approx，我们可以通过以下方式使用它

For e.g. in Hive we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("Open_Rate",0.10) from myTable);

但是出于性能原因，我想使用Spark DataFrame做到这一点.

But I want to do it using Spark DataFrame for performance reasons.

样本数据集

|User ID|Open_Rate|
------------------- 
|A1     |10.3     |
|B1     |4.04     |
|C1     |21.7     |
|D1     |18.6     |

我想找出有多少用户属于10％或20％等.我想做这样的事情

I want to find out how many users fall into 10 percentile or 20 percentile and so on. I want to do something like this

df.select($"id",Percentile($"Open_Rate",0.1)).show