Spark数据集聚合类似于RDD聚合(零)(累加,组合器) [英] Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

查看:193
本文介绍了Spark数据集聚合类似于RDD聚合(零)(累加,组合器)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RDD有一个非常有用的方法聚合,它允许累加一些零值并将其合并到各个分区中.有没有办法用Dataset[T]做到这一点.就我通过Scala doc看到的规范而言,实际上没有能力做到这一点.甚至reduce方法也只允许对以T为两个参数的二进制运算执行操作.有什么原因吗?如果有什么功能可以做到这一点?

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?

非常感谢!

VK

推荐答案

可以使用两种不同的类在Dataset API中实现类似aggregate的行为:

There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:

  • UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.

使用initialize方法定义初始值,使用update方法定义seqOp,使用merge方法定义combOp.

Initial value is defined using initialize method, seqOp with update method and combOp with merge method.

示例实现:如何定义自定义聚合函数以对Vector的列求和?

Aggregator ,它将标准Scala类型与Encoders一起使用,并以记录作为输入.

Aggregator which uses standard Scala types with Encoders and takes records as an input.

使用zero方法定义初始值,使用reduce方法定义seqOp,使用merge方法定义combOp.

Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.

示例实现:如何在Spark SQL中查找分组的Vector列的均值?

两者都提供了额外的终结处理方法(分别为evaluatefinish),用于生成最终结果,并且可以用于全局和密钥聚合.

Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

这篇关于Spark数据集聚合类似于RDD聚合(零)(累加,组合器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆