如何规范数据框中的数组列 [英] How to normalize an array column in a dataframe

查看：52 发布时间：2020/9/4 21:55:53 apache-spark dataframe spark-dataframe

本文介绍了如何规范数据框中的数组列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用spark 2.2.我想对固定大小数组中的每个值进行标准化.

I'm using spark 2.2. and I want to normalize each value in the fixed-size array.

输入

{"values": [1,2,3,4]}

输出

{"values": [0.25, 0.5, 0.75, 1] }

目前，我正在使用 udf :

val f = udf { (l: Seq[Double]) =>
  val max = l.max
  l.map(_ / max)
}

有没有办法避免udf(和相关的性能损失).

Is there a way to avoid udf (and associated performance penalty).

推荐答案

我想出了我的udf的优化版本，该版本执行就地更新.

I've come up with an optimized version of my udf, which performs in-place updates.

  val optimizedNormalizeUdf = udf { (l: mutable.WrappedArray[Double]) =>
    val max = l.max
    (0 until n).foreach(i => l.update(i, l(i) / max))
    l
  }

我已经写了一个基准来检查user8838736提出的解决方案的性能.这是结果.

I've written a benchmark to check performance of the solution proposed by user8838736. Here are the results.

[info] Benchmark                         Mode  Cnt    Score    Error  Units
[info] NormalizeBenchmark.builtin        avgt   10  140,293 ± 10,805  ms/op
[info] NormalizeBenchmark.udf_naive      avgt   10  104,708 ±  7,421  ms/op
[info] NormalizeBenchmark.udf_optimized  avgt   10   99,492 ±  7,829  ms/op

结论:在这种情况下， udf 是性能最高的解决方案.

Conclusion : The udf is the most performant solution in this case.

PS:对于那些感兴趣的人，基准测试的源代码在这里: https://github.com/YannMoisan/spark-jmh

PS : For those who are interested, the source code of the benchmark is here : https://github.com/YannMoisan/spark-jmh

这篇关于如何规范数据框中的数组列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何规范数据框中的数组列 [英] How to normalize an array column in a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何规范数据框中的数组列 [英] How to normalize an array column in a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭