roximate分位数给出了Spark(Scala)中的中位数不正确? [英] approxQuantile give incorrect Median in Spark (Scala)?

查看:70
本文介绍了roximate分位数给出了Spark(Scala)中的中位数不正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下测试数据:

 val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

我预计中位数为69.5335. 但是,当我尝试使用此代码找到确切的中位数时:

I'm expecting median to be 69.5335. But when I try to find exact median with this code:

df.stat.approxQuantile(column, Array(0.5), 0)

它给我:444.1235

It gives me: 444.1235

为什么这样以及如何解决?

Why is this so and how it can be fixed?

我正在这样做:

      val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

      val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
      val schema = StructType(Array(
        StructField("value", DataTypes.DoubleType, false)
      ))

      val df = sqlContext.createDataFrame(rdd, schema)
      df.createOrReplaceTempView(tableName)
val df2 = sc.sql(s"SELECT value FROM $tableName")
val median = df2.stat.approxQuantile("value", Array(0.5), 0)

所以我正在创建临时表.然后在其中搜索,然后计算结果.只是为了测试.

So I'm creating temp table. Then search inside it and then calculate result. It's just for testing.

推荐答案

请注意,这是近似分位数的计算.它不应该一直给您确切的答案.请参见此处以获得更详尽的解释.

Note that this is an approximate quantiles computation. It is not supposed to give you the exact answer all the time. See here for a more thorough explanation.

原因是,对于非常大的数据集,有时只要您获得的答案比实际计算的速度明显快,就可以得出近似答案.

The reason is that for very large datasets, sometimes you are OK with an approximate answer, as long as you get it significantly faster than the exact computation.

这篇关于roximate分位数给出了Spark(Scala)中的中位数不正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆