Spark RDD与Dataframe-数据存储 [英] Spark RDD vs Dataframe - Data storage
问题描述
我是Spark&的新手.了解数据框,操作和建筑学.在阅读有关RDD和Dataframe之间的比较时,我对RDD和Dataframe的数据结构感到困惑.以下是我的观察结果,如果发现错误,请帮助澄清/更正
I am new to Spark & learning about the Dataframe,operations & architecture. While reading about the comparison between RDD and Dataframe, i got confused with the data structure of both RDD and Dataframe. Below are my observation, Please help to clarify/correct it if it is wrong
1)如果源数据是群集(例如:HDFS),则RDD以分布方式(块)跨群集中的节点存储在计算机RAM中.
1)RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster,if the source data is an a cluster(eg: HDFS).
如果数据源只是单个CSV文件,则数据将分发到正在运行的服务器(如果是笔记本电脑)的RAM中的多个块.我说的对吗?
If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?
2)块和分区之间是否存在任何关系?哪个是超级套装?
2)Is there any relationship between block and partition? Which one is super set?
3)数据框:数据框是否也以与RDD相同的方式存储?如果我仅将源数据存储到数据帧中,是否将在支持中创建RDD?
3)Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?
先谢谢您了:)
推荐答案
如果源数据是群集(例如:HDFS),则将RDD以分布方式(块)跨群集中的节点存储在计算机RAM中.
RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster, if the source data is an a cluster(eg: HDFS).
如果启用了缓存
或 checkpointing
,它也可能存储在内存或磁盘中.另外,改组总是涉及磁盘写入.
If caching
or checkpointing
is enabled it is also might be stored either in memory or on disk. Also, shuffling always involves disk write.
如果数据源只是单个CSV文件,则数据将分发到正在运行的服务器(如果是笔记本电脑)的RAM中的多个块.我对吗?
If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?
CSV文件将被分为多个分区,每个任务将仅读取大块数据(起始端偏移量).
CSV file will be split into multiple partitions, and each task will only read a chunk of data (start-end offsets).
块和分区之间是否存在任何关系?哪一个是超级套装?
Is there any relationship between block and partition? Which one is super set?
这有点令人困惑,请看以下 answer 表示 split
是输入数据的逻辑划分,而 block
是数据的物理划分.Spark使用自己的术语,Spark中的 partition
与Hadoop中的split具有大致相同的含义.
It is a bit confusing, take a look at this answer which states that split
is a logical division of the input data while a block
is a physical division of data.
Spark uses its own terminology and partition
in Spark has roughly the same meaning as split in Hadoop.
When a file is read from HDFS HadoopRDD is being used and under the hood, each split
will become a partition
.
数据框:数据框是否也以与RDD相同的方式存储?如果我仅将源数据存储到数据帧中,是否将在支持中创建RDD?
Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?
Dataframe只是幕后的RDD [InternalRow].
看看
Dataframe is nothing else than RDD[InternalRow] under the hood.
Take a look at the SparkPlan.
这篇关于Spark RDD与Dataframe-数据存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!