我是否必须显式使用Dataframe的方法来利用Dataset的优化? [英] Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

查看:53
本文介绍了我是否必须显式使用Dataframe的方法来利用Dataset的优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要利用Dataset的优化,我是否必须显式使用Dataframe's方法(例如df.select(col("name"), col("age")等)或调用 any 方法- 甚至是类似RDD的方法 (例如filtermap等)是否也可以进行优化?

To take advantage of Dataset's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age"), etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter, map, etc) would also allow for optimization?

推荐答案

数据帧优化通常分为3种:

Dataframe optimization comes in general in 3 flavors:

  1. 钨记忆管理
  2. 催化剂查询优化
  3. 整个阶段的代码生成器

钨记忆管理

在定义RDD [myclass]时,spark对myclass是什么没有真正的了解.这意味着通常每行将包含该类的实例.

When defining an RDD[myclass], spark has no real understanding of what myclass is. This means that in general each row will contain an instance of the class.

这有两个问题.

第一个是对象的大小.一个Java对象有开销.例如,包含两个简单整数的案例类.依次执行1000000个实例并将其转换为RDD大约需要26MB,而对数据集/数据帧进行同样的操作则大约需要2MB.

The first is the size of the object. A java object has overheads. For example, a case class which contains two simple integers. Doing a sequence of 1000000 instances and turning it into an RDD would take ~26MB while doing the same with dataset/dataframe would take ~2MB.

此外,此数据集在数据集/数据帧中完成时不会由垃圾回收管理(内部由spark作为不安全的内存进行管理),因此GC性能的开销会更少.

In addition, this memory when done in dataset/dataframe is not managed by garbage collection (it is managed as unsafe memory internally by spark) and so would have less overhead in GC performance.

数据集享有与数据帧相同的内存管理优势.也就是说,在执行数据集操作时,将数据从内部(行)数据结构转换为案例类会产生性能开销.

Dataset enjoys the same memory management advantages of dataframe. That said, when doing dataset operations, the conversion of the data from the internal (Row) data structure to case class has an overhead in performance.

催化剂查询优化

使用数据框函数时,spark知道您要执行的操作,有时可以将查询修改为等效查询,这样效率更高.

When using dataframes functions, spark knows what you are trying to do and sometimes can modify your query to an equivalent one which is more efficient.

例如,假设您正在执行类似的操作: df.withColumn("a",lit(1)).filter($"b"<($"a" + 1)).

Let's say for example that you are doing something like: df.withColumn("a",lit(1)).filter($"b" < ($"a" + 1)).

基本上,您正在检查(x< 1 + 1). Spark足够聪明,可以理解这一点并将其更改为x< 2.

Basically you are checking if (x < 1 + 1). Spark is smart enough to understand this and change it to x<2.

当使用数据集操作作为火花无法完全了解您正在执行的函数的内部时,这类操作将无法完成.

These kind of operations cannot be done when using dataset operations as spark has no idea on the internals of the functions you are doing.

整个阶段的代码生成器

当spark知道您在做什么时,它实际上可以生成更有效的代码.在某些情况下,这可以使性能提高10倍.

When spark knows what you are doing it can actually generate more efficient code. This can improve performance by a factor of 10 in some cases.

这也不能在数据集函数上完成,因为spark不知道函数的内部.

This also cannot be done on dataset functions as spark does not know the internals of the functions.

这篇关于我是否必须显式使用Dataframe的方法来利用Dataset的优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆