Spark 数据集聚合类似于 RDD 聚合(零)(累加，组合器) [英] Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

查看：20 发布时间：2021/11/14 21:48:49 scala apache-spark apache-spark-sql rdd apache-spark-dataset

本文介绍了Spark 数据集聚合类似于 RDD 聚合(零)(累加，组合器)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

RDD 有一个非常有用的方法聚合，它允许累积一些零值并跨分区组合.有没有办法用 Dataset[T] 做到这一点.就我通过 Scala 文档看到的规范而言，实际上没有什么能够做到这一点.甚至reduce 方法也只允许对T 作为两个参数的二元运算进行操作.有什么理由吗?如果有什么能够做同样的事情吗?

非常感谢！

解决方案

有两个不同的类可用于在 Dataset API 中实现类似 aggregate 的行为:

UserDefinedAggregateFunction 使用 SQL 类型并以 Columns 作为输入.
初始值使用 initialize 方法定义，seqOp 使用 update 方法，combOp 使用 merge 方法.
示例实现:如何定义自定义聚合函数对一列 Vectors 求和?
Aggregator 使用带有 Encoders 的标准 Scala 类型，并将记录作为输入.
初始值使用 zero 方法定义，seqOp 使用 reduce 方法，combOp 使用 merge 方法.
示例实现:如何在 Spark SQL 中找到分组向量列的平均值?

两者都提供了额外的终结方法(分别为 evaluate 和 finish)，用于生成最终结果，可用于全局聚合和按键聚合.>

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?

Thanks a lot!

解决方案

There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:

UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.

Initial value is defined using initialize method, seqOp with update method and combOp with merge method.

Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.

Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.

Example implementation: How to find mean of grouped Vector columns in Spark SQL?

Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

这篇关于Spark 数据集聚合类似于 RDD 聚合(零)(累加，组合器)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark 数据集聚合类似于 RDD 聚合(零)(累加，组合器) [英] Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 数据集聚合类似于 RDD 聚合(零)(累加，组合器) [英] Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭