何时使用 Spark DataFrame/Dataset API,何时使用普通 RDD? [英] When to use Spark DataFrame/Dataset API and when to use plain RDD?

查看:25
本文介绍了何时使用 Spark DataFrame/Dataset API,何时使用普通 RDD?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark SQL DataFrame/Dataset 执行引擎有几个非常高效的时间&空间优化(例如 InternalRow 和表达式 codeGen).根据许多文档,对于大多数分布式算法,它似乎是比 RDD 更好的选择.

然而,我做了一些源代码研究,仍然不相信.我毫不怀疑 InternalRow 更紧凑,可以节省大量内存.但是算法的执行可能不会更快地保存预定义的表达式.也就是说,在org.apache.spark.sql.catalyst.expressions.ScalaUDF的源代码中表明,每个用户定义的函数都做三件事:

  1. 将催化剂类型(在 InternalRow 中使用)转换为 scala 类型(在 GenericRow 中使用).
  2. 应用函数
  3. 将结果从 Scala 类型转换回催化剂类型

显然,这比直接在 RDD 上应用函数更慢,无需任何转换.谁能通过一些真实案例分析和代码分析来证实或否认我的推测?

非常感谢您的任何建议或见解.

解决方案

来自这篇 Databricks 的博客文章

并且在提到的 Databricks 文章中,您还可以发现 Dataframe 与 RDD 相比优化了空间使用

Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms.

However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF, that every user defined function does 3 things:

  1. convert catalyst type (used in InternalRow) to scala type (used in GenericRow).
  2. apply the function
  3. convert the result back from scala type to catalyst type

Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis?

Thank you so much for any suggestion or insight.

解决方案

From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

When to use RDDs?

Consider these scenarios or common use cases for using RDDs when:

  • you want low-level transformation and actions and control on your dataset;
  • your data is unstructured, such as media streams or streams of text;
  • you want to manipulate your data with functional programming constructs than domain specific expressions;
  • you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column;
  • and you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD

And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD

这篇关于何时使用 Spark DataFrame/Dataset API,何时使用普通 RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆