如何规范数据框中的数组列 [英] How to normalize an array column in a dataframe
问题描述
我正在使用spark 2.2.我想对固定大小数组中的每个值进行标准化.
I'm using spark 2.2. and I want to normalize each value in the fixed-size array.
输入
{"values": [1,2,3,4]}
输出
{"values": [0.25, 0.5, 0.75, 1] }
目前,我正在使用 udf :
val f = udf { (l: Seq[Double]) =>
val max = l.max
l.map(_ / max)
}
有没有办法避免udf(和相关的性能损失).
Is there a way to avoid udf (and associated performance penalty).
推荐答案
我想出了我的udf的优化版本,该版本执行就地更新.
I've come up with an optimized version of my udf, which performs in-place updates.
val optimizedNormalizeUdf = udf { (l: mutable.WrappedArray[Double]) =>
val max = l.max
(0 until n).foreach(i => l.update(i, l(i) / max))
l
}
我已经写了一个基准来检查user8838736提出的解决方案的性能.这是结果.
I've written a benchmark to check performance of the solution proposed by user8838736. Here are the results.
[info] Benchmark Mode Cnt Score Error Units
[info] NormalizeBenchmark.builtin avgt 10 140,293 ± 10,805 ms/op
[info] NormalizeBenchmark.udf_naive avgt 10 104,708 ± 7,421 ms/op
[info] NormalizeBenchmark.udf_optimized avgt 10 99,492 ± 7,829 ms/op
结论:在这种情况下, udf 是性能最高的解决方案.
Conclusion : The udf is the most performant solution in this case.
PS:对于那些感兴趣的人,基准测试的源代码在这里: https://github.com/YannMoisan/spark-jmh
PS : For those who are interested, the source code of the benchmark is here : https://github.com/YannMoisan/spark-jmh
这篇关于如何规范数据框中的数组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!