spark:聚合器和UDAF有什么区别? [英] spark: What is the difference between Aggregator and UDAF?

查看:285
本文介绍了spark:聚合器和UDAF有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark的文档中,聚合器:

In Spark's documentation, Aggregator:


抽象类聚合器[-IN,BUF,OUT]扩展了可序列化

abstract class Aggregator[-IN, BUF, OUT] extends Serializable

用户定义的聚合的基类,可以是
在数据集操作中用于获取组中的所有元素,而
则将它们减少为单个值。

A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.

UserDefinedAggregateFunction是:

UserDefinedAggregateFunction is:


抽象类UserDefinedAggregateFunction扩展了Serializable

abstract class UserDefinedAggregateFunction extends Serializable

用于实现用户定义的聚合函数
(UDAF)的基类。

The base class for implementing user-defined aggregate functions (UDAF).

根据数据集聚合器-Databricks ,聚合器类似于UDAF,但是接口以JVM对象而不是行的形式表示。

According to Dataset Aggregator - Databricks, "an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row ."

它似乎这两个类非常相似,除了界面类型以外还有其他区别吗?

It seems these two classes are very similar, what are other differences apart from the types in the interface?

类似的问题是: Spark中UDAF与聚合器的性能

推荐答案

除类型外,一个基本区别是外部接口:

A fundamental difference, apart from types, is external interface:


  • Aggregator 需要完整的 Row (适用于强类型的API)。

  • UserDefinedAggregationFunction 需要一组

  • Aggregator takes a complete Row (it is intended for "strongly" typed API).
  • UserDefinedAggregationFunction takes a set of Columns.

这使 Aggregator 的灵活性降低,尽管总体API更加易于使用。

This makes Aggregator less flexible, although overall API is far more user friendly.

处理状态也有所不同:


  • 聚合器是有状态的。取决于其缓冲区字段的可变内部状态。

  • UserDefinedAggregateFunction 是无状态的。缓冲区的状态是外部的。

  • Aggregator is stateful. Depends on mutable internal state of its buffer field.
  • UserDefinedAggregateFunction is stateless. State of the buffer is external.

这篇关于spark:聚合器和UDAF有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆