Spark 数据集聚合类似于 RDD 聚合(零)(累加,组合器) [英] Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

查看:20
本文介绍了Spark 数据集聚合类似于 RDD 聚合(零)(累加,组合器)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RDD 有一个非常有用的方法聚合,它允许累积一些零值并跨分区组合.有没有办法用 Dataset[T] 做到这一点.就我通过 Scala 文档看到的规范而言,实际上没有什么能够做到这一点.甚至reduce 方法也只允许对T 作为两个参数的二元运算进行操作.有什么理由吗?如果有什么能够做同样的事情吗?

非常感谢!

VK

解决方案

有两个不同的类可用于在 Dataset API 中实现类似 aggregate 的行为:

两者都提供了额外的终结方法(分别为 evaluatefinish),用于生成最终结果,可用于全局聚合和按键聚合.>

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?

Thanks a lot!

VK

解决方案

There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:

Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

这篇关于Spark 数据集聚合类似于 RDD 聚合(零)(累加,组合器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆