spark:聚合器和UDAF有什么区别? [英] spark: What is the difference between Aggregator and UDAF?
问题描述
在Spark的文档中,聚合器:
In Spark's documentation, Aggregator:
抽象类聚合器[-IN,BUF,OUT]扩展了可序列化
abstract class Aggregator[-IN, BUF, OUT] extends Serializable
用户定义的聚合的基类,可以是
在数据集操作中用于获取组中的所有元素,而
则将它们减少为单个值。
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.
UserDefinedAggregateFunction是:
UserDefinedAggregateFunction is:
抽象类UserDefinedAggregateFunction扩展了Serializable
abstract class UserDefinedAggregateFunction extends Serializable
用于实现用户定义的聚合函数
(UDAF)的基类。
The base class for implementing user-defined aggregate functions (UDAF).
根据数据集聚合器-Databricks ,聚合器类似于UDAF,但是接口以JVM对象而不是行的形式表示。
According to Dataset Aggregator - Databricks, "an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row ."
它似乎这两个类非常相似,除了界面类型以外还有其他区别吗?
It seems these two classes are very similar, what are other differences apart from the types in the interface?
类似的问题是: Spark中UDAF与聚合器的性能
推荐答案
除类型外,一个基本区别是外部接口:
A fundamental difference, apart from types, is external interface:
-
Aggregator
需要完整的Row
(适用于强类型的API)。 -
UserDefinedAggregationFunction
需要一组列
。
Aggregator
takes a completeRow
(it is intended for "strongly" typed API).UserDefinedAggregationFunction
takes a set ofColumns
.
这使 Aggregator
的灵活性降低,尽管总体API更加易于使用。
This makes Aggregator
less flexible, although overall API is far more user friendly.
处理状态也有所不同:
-
聚合器
是有状态的。取决于其缓冲区字段的可变内部状态。 -
UserDefinedAggregateFunction
是无状态的。缓冲区的状态是外部的。
Aggregator
is stateful. Depends on mutable internal state of its buffer field.UserDefinedAggregateFunction
is stateless. State of the buffer is external.
这篇关于spark:聚合器和UDAF有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!