Spark中DataFrame、Dataset、RDD的区别 [英] Difference between DataFrame, Dataset, and RDD in Spark
问题描述
我只是想知道 RDD
和 DataFrame
之间有什么区别 (Spark 2.0.0 DataFrame 只是 Dataset 的类型别名[行]
) 在 Apache Spark 中?
I'm just wondering what is the difference between an RDD
and DataFrame
(Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]
) in Apache Spark?
你能把一个转换成另一个吗?
Can you convert one to the other?
推荐答案
A DataFrame
is defined with a google search for "DataFrame definition":
A DataFrame
is defined well with a google search for "DataFrame definition":
一个数据框是一个表格,或者二维数组状的结构,在其中每一列包含对一个变量的测量,每一行包含一个案例.
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
因此,DataFrame
由于其表格格式而具有额外的元数据,这允许 Spark 对最终查询运行某些优化.
So, a DataFrame
has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
RDD
,另一方面,只是一个R弹性D分布式Dataset,它是更多的是无法优化的数据黑匣子,因为可以对其执行的操作不受约束.
An RDD
, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
但是,您可以通过其 rdd
方法从 DataFrame 转到 RDD
,也可以从 RDD
转到 RDD
code>DataFrame(如果RDD是表格格式)通过toDF
方法
However, you can go from a DataFrame to an RDD
via its rdd
method, and you can go from an RDD
to a DataFrame
(if the RDD is in a tabular format) via the toDF
method
一般情况,由于内置查询优化,建议尽可能使用 DataFrame
.
In general it is recommended to use a DataFrame
where possible due to the built in query optimization.
这篇关于Spark中DataFrame、Dataset、RDD的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!