如何在 Spark 中找到分组数据的确切中位数 [英] How to find exact median for grouped data in Spark

查看:35
本文介绍了如何在 Spark 中找到分组数据的确切中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 Scala 计算 Spark 中 Double 数据类型分组数据集的精确中位数.

I have a requirement to calculate exact median on grouped data set of Double datatype in Spark using Scala.

它与类似查询不同:在 spark SQL 中查找多个双数据类型列的中位数.这个问题是关于分组数据的查找数据,而另一个是关于在 RDD 级别上查找中位数.

It is different from the similar query: Find median in spark SQL for multiple double datatype columns. This question is about the finding data for grouped data, whereas the other one is about finding median on a RDD level.

这是我的示例数据

scala> sqlContext.sql("select * from test").show()

+---+---+
| id|num|
+---+---+
|  A|0.0|
|  A|1.0|
|  A|1.0|
|  A|1.0|
|  A|0.0|
|  A|1.0|
|  B|0.0|
|  B|1.0|
|  B|1.0|
+---+---+

预期答案:

+--------+
| Median |
+--------+
|   1    |
|   1    |
+--------+

我尝试了以下选项,但没有成功:

I tried the following option, but no luck:

1) Hive 函数百分位,仅适用于 BigInt.

1) Hive function percentile, it worked only for BigInt.

2) Hive 函数 percentile_approx,但它没有按预期工作(返回 0.25 vs 1).

2) Hive function percentile_approx, but it does not work as expected (returns 0.25 vs 1).

scala> sqlContext.sql("select percentile_approx(num, 0.5) from test group by id").show()

+----+
| _c0|
+----+
|0.25|
|0.25|
+----+

推荐答案

Simplest Approach(需要 Spark 2.0.1+ 而不是精确中位数)

如评论中提到的第一个问题 在 Spark SQL 中查找双数据类型列的中位数,我们可以使用 percentile_approx 来计算 Spark 2.0.1+ 的中位数.要将其应用于 Apache Spark 中的分组数据,查询将如下所示:

Simplest Approach (requires Spark 2.0.1+ and not exact median)

As noted in the comments in reference to the first question Find median in Spark SQL for double datatype columns, we can use percentile_approx to calculate median for Spark 2.0.1+. To apply this for grouped data in Apache Spark, the query would look like:

val df = Seq(("A", 0.0), ("A", 0.0), ("A", 1.0), ("A", 1.0), ("A", 1.0), ("A", 1.0), ("B", 0.0), ("B", 1.0), ("B", 1.0)).toDF("id", "num")
df.createOrReplaceTempView("df")
spark.sql("select id, percentile_approx(num, 0.5) as median from df group by id order by id").show()

输出为:

+---+------+
| id|median|
+---+------+
|  A|   1.0|
|  B|   1.0|
+---+------+

这么说,这是一个近似值(而不是每个问题的确切中位数).

Saying this, this is an approximate value (as opposed to an exact median per the question).

有多种方法,所以我相信 SO 中的其他人可以提供更好或更有效的示例.但这里的代码片段计算 Spark 中分组数据的中值(在 Spark 1.6 和 Spark 2.1 中验证):

There are multiple approaches so I'm sure others in SO can provide better or more efficient examples. But here's a code snippet calculate the median for grouped data in Spark (verified in Spark 1.6 and Spark 2.1):

import org.apache.spark.SparkContext._

val rdd: RDD[(String, Double)] = sc.parallelize(Seq(("A", 1.0), ("A", 0.0), ("A", 1.0), ("A", 1.0), ("A", 0.0), ("A", 1.0), ("B", 0.0), ("B", 1.0), ("B", 1.0)))

// Scala median function
def median(inputList: List[Double]): Double = {
  val count = inputList.size
  if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (inputList(l) + inputList(r)).toDouble / 2
  } else
    inputList(count / 2).toDouble
}

// Sort the values
val setRDD = rdd.groupByKey()
val sortedListRDD = setRDD.mapValues(_.toList.sorted)

// Output DataFrame of id and median
sortedListRDD.map(m => {
  (m._1, median(m._2))
}).toDF("id", "median_of_num").show()

输出为:

+---+-------------+
| id|median_of_num|
+---+-------------+
|  A|          1.0|
|  B|          1.0|
+---+-------------+

我应该指出一些警告,因为这可能不是最有效的实现:

There are some caveats that I should call out as this likely isn't the most efficient implementation:

  • 目前使用的是性能不是很好的 groupByKey.您可能希望将其更改为 reduceByKey(更多信息位于 避免使用 GroupByKey)
  • 使用 Scala 函数计算中值.
  • It's currently using a groupByKey which is not very performant. You may want to change this into a reduceByKey instead (more information at Avoid GroupByKey)
  • Using a Scala function to calculate the median.

这种方法对于少量数据应该没问题,但如果每个键有数百万行,建议使用 Spark 2.0.1+ 并使用 percentile_approx 方法.

This approach should work okay for smaller amounts of data but if you have millions of rows for each key, would advise utilizing Spark 2.0.1+ and using the percentile_approx approach.

这篇关于如何在 Spark 中找到分组数据的确切中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆