计算一个数据框星火分组数据的标准差 [英] Calculate the standard deviation of grouped data in a Spark DataFrame
问题描述
我有我已经从csv取并转换成一个数据帧,以利用SparkSQL查询功能,用户登录。单个用户将创建每小时众多的条目,我想收集每个用户的一些基本统计资料;实际上只是用户实例的计数,平均,和许多列的标准偏差。我能够很快得到均值和使用GROUPBY计数信息($用户),并与SparkSQL功能数和平均聚合器:
I have user logs that I have taken from a csv and converted into a DataFrame in order to leverage the SparkSQL querying features. A single user will create numerous entries per hour, and I would like to gather some basic statistical information for each user; really just the count of the user instances, the average, and the standard deviation of numerous columns. I was able to quickly get the mean and count information by using groupBy($"user") and the aggregator with SparkSQL functions for count and avg:
val meanData = selectedData.groupBy($"user").agg(count($"logOn"),
avg($"transaction"), avg($"submit"), avg($"submitsPerHour"), avg($"replies"),
avg($"repliesPerHour"), avg($"duration"))
不过,我似乎无法找到一个同样优雅的方式来计算标准偏差。到目前为止,我只能通过映射一个字符串,双对计算它,使用StatCounter的()STDEV实用工具:
However, I cannot seem to find an equally elegant way to calculate the standard deviation. So far I can only calculate it by mapping a string, double pair and use StatCounter().stdev utility:
val stdevduration = duration.groupByKey().mapValues(value =>
org.apache.spark.util.StatCounter(value).stdev)
然而,这将返回一个RDD,我想尝试,并保持它的所有的数据帧进行进一步查询可以返回的数据。
This returns an RDD however, and I would like to try and keep it all in a DataFrame for further queries to be possible on the returned data.
推荐答案
修改
在星火> = 1.6可以使用 STDDEV_POP
来计算总体标准偏差和 STDDEV
/ STDDEV_SAMP
来计算无偏的样本标准差:
In Spark >= 1.6 you can use stddev_pop
to compute population standard deviation and stddev
/ stddev_samp
to compute unbiased sample standard deviation:
import org.apache.spark.sql.functions.{stddev_samp, stddev_pop}
selectedData.groupBy($"user").agg(stdev_pop($"duration"))
原来的答复(星火1.5及以下):
Original answer (Spark 1.5 and below):
没那么pretty和偏见(与从返回值描述
),但使用的公式:
Not so pretty and biased (same as the value returned from describe
) but using formula:
你可以做这样的事情:
import org.apache.spark.sql.functions.sqrt
selectedData
.groupBy($"user")
.agg((sqrt(
avg($"duration" * $"duration") -
avg($"duration") * avg($"duration")
)).alias("duration_sd"))
您当然可以创建一个函数来减少杂波:
You can of course create a function to reduce the clutter:
import org.apache.spark.sql.Column
def mySd(col: Column): Column = {
sqrt(avg(col * col) - avg(col) * avg(col))
}
df.groupBy($"user").agg(mySd($"duration").alias("duration_sd"))
另外,也可以使用配置单元UDF:
It is also possible to use Hive UDF:
df.registerTempTable("df")
sqlContext.sql("""SELECT user, stddev(duration)
FROM df
GROUP BY user""")
图片来源: https://en.wikipedia.org/wiki/Standard_deviation
这篇关于计算一个数据框星火分组数据的标准差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!