如何计算Spark中DataFrame中列的百分位数? [英] How to calculate Percentile of column in a DataFrame in spark?
问题描述
我正在尝试计算DataFrame中列的百分位数?我在Spark聚合函数中找不到任何percentile_approx函数.
I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions.
例如在Hive中,我们有percentile_approx,我们可以通过以下方式使用它
For e.g. in Hive we have percentile_approx and we can use it in the following way
hiveContext.sql("select percentile_approx("Open_Rate",0.10) from myTable);
但是出于性能原因,我想使用Spark DataFrame做到这一点.
But I want to do it using Spark DataFrame for performance reasons.
样本数据集
|User ID|Open_Rate|
-------------------
|A1 |10.3 |
|B1 |4.04 |
|C1 |21.7 |
|D1 |18.6 |
我想找出有多少用户属于10%或20%等.我想做这样的事情
I want to find out how many users fall into 10 percentile or 20 percentile and so on. I want to do something like this
df.select($"id",Percentile($"Open_Rate",0.1)).show
推荐答案
自Spark2.0起,事情变得越来越简单,只需在DataFrameStatFunctions之类的函数中使用此函数即可:
Since Spark2.0, things are getting easier,simply use this function in DataFrameStatFunctions like :
df.stat.approxQuantile("Open_Rate",Array(0.25,0.50,0.75),0.0)
DataFrameStatFunctions中还为DataFrame提供了一些有用的统计功能.
There are also some useful statistic functions for DataFrame in DataFrameStatFunctions.
这篇关于如何计算Spark中DataFrame中列的百分位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!