Spark中DataFrame,Dataset和RDD之间的区别 [英] Difference between DataFrame, Dataset, and RDD in Spark
问题描述
我只是想知道RDD
和DataFrame
(Spark 2.0.0 DataFrame是Dataset[Row]
的纯类型别名)之间的区别是什么??
I'm just wondering what is the difference between an RDD
and DataFrame
(Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]
) in Apache Spark?
可以将一个转换为另一个吗?
Can you convert one to the other?
推荐答案
通过Google搜索"DataFrame definition",很好地定义了DataFrame
:
A DataFrame
is defined well with a google search for "DataFrame definition":
数据帧是表格或二维数组状结构, 其中每一列包含对一个变量的度量,每一行包含 包含一个案例.
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
因此,DataFrame
由于其表格格式而具有其他元数据,这使得Spark可以在最终查询中运行某些优化.
So, a DataFrame
has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
另一方面,RDD
仅仅是一个 R 弹性 D 分配的 D 资产集,它更像是一个黑匣子.无法对其进行优化的数据不受约束.
An RDD
, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
但是,您可以通过rdd
方法从DataFrame转到RDD
,并且可以通过RDD
从RDD
到DataFrame
(如果RDD为表格格式). c10>方法
However, you can go from a DataFrame to an RDD
via its rdd
method, and you can go from an RDD
to a DataFrame
(if the RDD is in a tabular format) via the toDF
method
通常,由于内置的查询优化功能,建议尽可能使用DataFrame
.
In general it is recommended to use a DataFrame
where possible due to the built in query optimization.
这篇关于Spark中DataFrame,Dataset和RDD之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!