Scala-来自没有Hive的Spark SQLContext数据帧的第一四分位数,第三四分位数和IQR [英] Scala - First quartile, third quartile, and IQR from spark SQLContext dataframe without Hive

查看:47
本文介绍了Scala-来自没有Hive的Spark SQLContext数据帧的第一四分位数,第三四分位数和IQR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框:

data.show()
+--------+------+------------------+
|   Count|  mean|             stdev|
+--------+------+------------------+
|       5|  6337| 1684.569470220803|
|       3|  7224| 567.8250904401182|
|     330| 20280|23954.260831863092|
|      42| 26586|  32957.9072313323|
...
|      49| 23422|21244.094701798418|
|       4| 36949| 8616.596311769514|
|      35| 20915|14971.559603562522|
|      33| 20874|16657.756963894684|
|      14| 22698|15416.614921307082|
|      25| 19100| 12342.11627585264|
|      27| 21879|21363.736895687238|
+--------+------+------------------+

在不使用Hive的情况下,我想获取均值"列的第一个四分位数,第二个四分位数和IQR(四分位数间距).

Without using Hive, I want to get the first quartile, second quartile and the IQR (interquartile range) for column "mean".

其他解决方案似乎使用了Hive,每个人都可能无法使用它.

Other solutions seem to use Hive which everyone might not have access to.

配置单元解决方案1

配置单元解决方案2

Python解决方案

推荐答案

我首先要指出的是,这似乎是一个非常昂贵的解决方案,但我确实可以使用Hive获得所需的一切.如果您能够使用Hive,肯定可以这样做,因为它再简单不过了.

I'd like to first note that this seems to be a pretty expensive solution but I get precisely what I want wihtout using Hive. If you are able to use Hive definitely do it because it couldn't be any easier.

我最终使用了commons-math3 jar.使用它的技巧是将数据从数据帧中取出并放入一个数组中,以供math3库使用.我从 HERE .您可能必须根据列的数据类型使用"asInstanceOf".

I ended up using commons-math3 jar. The trick to using it was getting the data out of the dataframe and into an array for consumption by the math3 library. I solved that from HERE. You may have to play with the "asInstanceOf" based on the datatype of the column.

import org.apache.commons.math3.stat.descriptive._

// Turn dataframe column into an Array[Long]
val mean = data.select("mean").rdd.map(row => row(0).asInstanceOf[Long]).collect()

// Create the math3 object and add values from the
// mean array to the descriptive statistics array
val arrMean = new DescriptiveStatistics()
genericArrayOps(mean).foreach(v => arrMean.addValue(v))

// Get first and third quartiles and then calc IQR
val meanQ1 = arrMean.getPercentile(25)
val meanQ3 = arrMean.getPercentile(75)
val meanIQR = meanQ3 - meanQ1

这篇关于Scala-来自没有Hive的Spark SQLContext数据帧的第一四分位数,第三四分位数和IQR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆