Spark中的RDD和Dataframe有什么区别? [英] What's the difference between RDD and Dataframe in Spark?
问题描述
我对Apache Spark比较陌生.我想了解RDD,数据框和数据集之间的区别.
Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets.
例如,我正在从s3存储桶中提取数据.
For example, I am pulling data from s3 bucket.
df=spark.read.parquet("s3://output/unattributedunattributed*")
在这种情况下,当我从s3加载数据时,RDD是什么?另外,由于RDD是不可变的,因此我可以更改df的值,因此rdf不能被rdd.
In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd.
欣赏是否有人可以解释RDD,数据框和数据集之间的区别.
Appreciate if someone can explain the difference between RDD,dataframe and datasets.
推荐答案
df=spark.read.parquet("s3://output/unattributedunattributed*")
使用此语句,您正在创建一个数据框.
With this statement, you are creating a data frame.
要创建RDD,请使用
df=spark.textFile("s3://output/unattributedunattributed*")
RDD代表弹性分布式数据集.它是记录的只读分区集合. RDD是Spark的基本数据结构.它允许程序员执行内存中的计算
RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations
在Dataframe中,将数据组织到命名列中.例如,关系数据库中的表.它是不可变的分布式数据集合. Spark中的DataFrame允许开发人员将结构强加到分布式数据集合上,从而可以进行更高级别的抽象.
In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
- 如果要将地图或过滤器应用于整个数据集,请使用RDD
- 如果要处理单个列或要对列执行操作/计算,请使用Dataframe.
例如,如果要将整个数据中的"A"替换为"B" 那么RDD很有用.
for example, if you want to replace 'A' in whole data with 'B' then RDD is useful.
rdd = rdd.map(lambda x: x.replace('A','B')
如果要更新列的数据类型,请使用Dataframe.
if you want to update the data type of the column, then use Dataframe.
dff = dff.withColumn("LastmodifiedTime_timestamp", col('LastmodifiedTime_time').cast('timestamp')
RDD可以转换为Dataframe,反之亦然.
RDD can be converted into Dataframe and vice versa.
这篇关于Spark中的RDD和Dataframe有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!