spark:聚合器和UDAF有什么区别? [英] spark: What is the difference between Aggregator and UDAF?

查看:35
本文介绍了spark:聚合器和UDAF有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 的文档中,聚合器:

In Spark's documentation, Aggregator:

抽象类聚合器[-IN, BUF, OUT] 扩展可序列化

abstract class Aggregator[-IN, BUF, OUT] extends Serializable

用户定义聚合的基类,可以是用于数据集操作以获取组的所有元素和将它们减少到一个值.

A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.

UserDefinedAggregateFunction 是:

UserDefinedAggregateFunction is:

抽象类 UserDefinedAggregateFunction 扩展可序列化

abstract class UserDefinedAggregateFunction extends Serializable

实现用户自定义聚合函数的基类(UDAF).

The base class for implementing user-defined aggregate functions (UDAF).

根据 数据集聚合器 - Databricks,聚合器类似于 UDAF,但接口是根据 JVM 对象而不是 Row 表示的."

According to Dataset Aggregator - Databricks, "an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row ."

这两个类好像很相似,除了接口的类型之外还有什么区别?

It seems these two classes are very similar, what are other differences apart from the types in the interface?

一个类似的问题是:UDAF 与 Spark 中聚合器的性能

推荐答案

除了类型之外,一个根本的区别是外部接口:

A fundamental difference, apart from types, is external interface:

  • Aggregator 需要一个完整的 Row(它用于强"类型的 API).
  • UserDefinedAggregationFunction 采用一组 Columns.
  • Aggregator takes a complete Row (it is intended for "strongly" typed API).
  • UserDefinedAggregationFunction takes a set of Columns.

这使得 Aggregator 不太灵活,尽管整体 API 对用户更加友好.

This makes Aggregator less flexible, although overall API is far more user friendly.

处理状态也有区别:

  • Aggregator 是有状态的.取决于其缓冲区字段的可变内部状态.
  • UserDefinedAggregateFunction 是无状态的.缓冲区的状态是外部的.
  • Aggregator is stateful. Depends on mutable internal state of its buffer field.
  • UserDefinedAggregateFunction is stateless. State of the buffer is external.

这篇关于spark:聚合器和UDAF有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆