我是否必须明确使用 Dataframe 的方法来利用 Dataset 的优化? [英] Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

查看:29
本文介绍了我是否必须明确使用 Dataframe 的方法来利用 Dataset 的优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了利用Dataset的优化,我是否必须显式使用Dataframe的方法(例如df.select(col("name"),col("age") 等) 或调用任何数据集的方法 - 甚至 RDD 类方法(例如 filtermap 等)也允许优化?

To take advantage of Dataset's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age"), etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter, map, etc) would also allow for optimization?

推荐答案

Dataframe 优化通常有 3 种风格:

Dataframe optimization comes in general in 3 flavors:

  1. Tungsten 内存管理
  2. 催化剂查询优化
  3. 整个阶段的代码生成

Tungsten 内存管理

在定义 RDD[myclass] 时,spark 并没有真正理解 myclass 是什么.这意味着通常每一行都将包含该类的一个实例.

When defining an RDD[myclass], spark has no real understanding of what myclass is. This means that in general each row will contain an instance of the class.

这有两个问题.

首先是对象的大小.java 对象有开销.例如,包含两个简单整数的 case 类.执行 1000000 个实例的序列并将其转换为 RDD 需要约 26MB,而对数据集/数据帧执行相同操作需要约 2MB.

The first is the size of the object. A java object has overheads. For example, a case class which contains two simple integers. Doing a sequence of 1000000 instances and turning it into an RDD would take ~26MB while doing the same with dataset/dataframe would take ~2MB.

此外,在数据集/数据帧中完成时,此内存不由垃圾收集管理(它在内部由 spark 管理为不安全内存),因此在 GC 性能方面的开销较小.

In addition, this memory when done in dataset/dataframe is not managed by garbage collection (it is managed as unsafe memory internally by spark) and so would have less overhead in GC performance.

Dataset 享有与 dataframe 相同的内存管理优势.也就是说,在进行数据集操作时,将数据从内部(Row)数据结构转换为 case 类会带来性能开销.

Dataset enjoys the same memory management advantages of dataframe. That said, when doing dataset operations, the conversion of the data from the internal (Row) data structure to case class has an overhead in performance.

催化剂查询优化

当使用数据帧函数时,spark 知道您要做什么,有时可以将您的查询修改为更有效的等效查询.

When using dataframes functions, spark knows what you are trying to do and sometimes can modify your query to an equivalent one which is more efficient.

例如,假设您正在执行以下操作:df.withColumn("a",lit(1)).filter($"b" < ($"a" + 1)).

Let's say for example that you are doing something like: df.withColumn("a",lit(1)).filter($"b" < ($"a" + 1)).

基本上你正在检查是否 (x <1 + 1).Spark 足够聪明,可以理解这一点并将其更改为 x<2.

Basically you are checking if (x < 1 + 1). Spark is smart enough to understand this and change it to x<2.

在使用数据集操作时无法进行此类操作,因为 spark 不知道您正在执行的函数的内部结构.

These kind of operations cannot be done when using dataset operations as spark has no idea on the internals of the functions you are doing.

全阶段代码生成

当 spark 知道你在做什么时,它实际上可以生成更高效的代码.在某些情况下,这可以将性能提高 10 倍.

When spark knows what you are doing it can actually generate more efficient code. This can improve performance by a factor of 10 in some cases.

这也不能在数据集函数上完成,因为 spark 不知道函数的内部结构.

This also cannot be done on dataset functions as spark does not know the internals of the functions.

这篇关于我是否必须明确使用 Dataframe 的方法来利用 Dataset 的优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆