什么是星火的DataSet和RDD之间的区别 [英] What is the difference between Spark DataSet and RDD

查看:374
本文介绍了什么是星火的DataSet和RDD之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍然在努力了解最近推出的Spark数据集的全部功能。

I'm still struggling to understand the full power of the recently introduced Spark Datasets.

是否有何时使用RDDS最佳实践,以及何时使用的数据集?

Are there best practices of when to use RDDs and when to use Datasets?

在他们的公告 Databricks解释说,通过使用在这两个运行时和存储器的数据集惊人的减少可以实现。它仍然声称数据集设计'''除了已有的RDD API的工作。

In their announcement Databricks explains that by using Datasets staggering reductions in both runtime and memory can be achieved. Still it is claimed that Datasets are designed ''to work alongside the existing RDD API''.

这是只是为了向下兼容参考还是有方案,其中一个将preFER使用RDDS在数据集?

Is this just a reference to downward compatibility or are there scenarios where one would prefer to use RDDs over Datasets?

推荐答案

在这个时刻(星火1.6.0)的DataSet API仅仅是一个preVIEW,只的特征的小的子集被实现,因此不可能将讲述最佳做法什么。

At this moment (Spark 1.6.0) DataSet API is just a preview and only a small subset of features is implemented so it is not possible to tell anything about best practices.

从概念上讲星火的DataSet 只是一个数据帧额外的类型安全(或者,如果preFER <一个href=\"https://github.com/apache/spark/blob/1ed354a5362967d904e9513e5a1618676c9c67a6/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L58\"相对=nofollow>在未来一目了然 数据帧数据集[行] )。这意味着您可以所有好处催化剂和<一个href=\"https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html\"相对=nofollow>钨。它包括逻辑和物理计划优化,量化操作和低级别的内存管理。

Conceptually Spark DataSet is just a DataFrame with additional type safety (or if you prefer a glance at the future DataFrame is a DataSet[Row]). It means you get all the benefits of Catalyst and Tungsten. It includes logical and physical plan optimization, vectorized operations and low level memory management.

你失去了什么是灵活性和透明度。

What you loose is flexibility and transparency.

所有的数据首先必须设有codeD才能与的DataSet 使用。星火提供带codeRS的基本类型和产品/ case类和现在定义自定义序列化所需的API不可用。最有可能会比较类似UDT API(例如,见如何星火SQL定义架构的自定义类型?序列化/反序列化现有的类SQL火花数据框)及其所有的问题。这是比较繁琐,需要额外的努力,可以用复杂的对象变得远离明显。此外,它倒是这是不是非常有据可查的API的低层次方面的问题。

First of all your data has to be encoded before it can be used with DataSet. Spark provides encoders for primitive types and Products / case classes and as for now API required to define custom serialization is not available. Most likely it will be relatively similar to UDT API (see for example How to define schema for custom type in Spark SQL?, Serialize/Deserialize existing class for spark sql dataframe) with all its issues. It is relatively verbose, requires additional effort and can become far from obvious with complex objects. Moreover it touches some lower level aspects of the API which are not very well documented.

关于透明度是pretty大致相同的问题,因为在一个典型的RDBMS策划者。这是伟大的,直到它不是。这是惊人的工具,它可以分析你的数据,做出明智的变革,但任何工具,它可以采取一个错误的道路和叶盯着执行计划,并试图找出如何使事情的工作。

Regarding transparency it is pretty much the same problem as with a planner in a typical RDBMS. It is great until it isn't. It is amazing tool, it can analyze your data, make smart transformations but as any tool it can take a wrong path and leaves staring into execution plan and trying to figure out how to make things work.

根据一个preVIEW我想说它可以某处数据帧 API和RDD API之间放置。它比 DataFrames 更灵活,但是仍然提供了类似的优化,非常适合一般的数据处理任务。它不提供同样的灵活性(至少没有更深潜入催化剂内部),为RDD的API。

Based on a preview I would say it can be placed somewhere between DataFrame API and RDD API. It is more flexible than DataFrames but still provides similar optimizations and is well suited for general data processing tasks. It doesn't provide the same flexibility (at least without a deeper dive into Catalyst internals) as a RDD API.

另外一个区别,这是在这一刻只是假设,是一种方式,它如何与客人语言(R,Python)的相互作用。到数据帧的DataSet 类似的属于JVM。这意味着,任何可能的相互作用可以同时属于两个类别之一:原生JVM运行(如数据帧前pressions)的评价和侧面code(如Python UDF)。不幸的是,第二部分需要JVM和来宾环境之间昂贵的往返。

Another difference, which is at this moment just hypothetical, is a way how it interacts with guest languages (R, Python). Similar to DataFrame, DataSet belongs to JVM. It means that any possible interaction can belong to the one of two categories: native JVM operation (like DataFrame expressions) and guest side code (like Python UDF). Unfortunately the second part requires expensive round-trip between JVM and a guest environment.

这篇关于什么是星火的DataSet和RDD之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆