Spark RDD与Dataframe-数据存储 [英] Spark RDD vs Dataframe - Data storage

查看:72
本文介绍了Spark RDD与Dataframe-数据存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark&的新手.了解数据框,操作和建筑学.在阅读有关RDD和Dataframe之间的比较时,我对RDD和Dataframe的数据结构感到困惑.以下是我的观察结果,如果发现错误,请帮助澄清/更正

I am new to Spark & learning about the Dataframe,operations & architecture. While reading about the comparison between RDD and Dataframe, i got confused with the data structure of both RDD and Dataframe. Below are my observation, Please help to clarify/correct it if it is wrong

1)如果源数据是群集(例如:HDFS),则RDD以分布方式(块)跨群集中的节点存储在计算机RAM中.

1)RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster,if the source data is an a cluster(eg: HDFS).

如果数据源只是单个CSV文件,则数据将分发到正在运行的服务器(如果是笔记本电脑)的RAM中的多个块.我说的对吗?

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

2)块和分区之间是否存在任何关系?哪个是超级套装?

2)Is there any relationship between block and partition? Which one is super set?

3)数据框:数据框是否也以与RDD相同的方式存储?如果我仅将源数据存储到数据帧中,是否将在支持中创建RDD?

3)Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

先谢谢您了:)

推荐答案

如果源数据是群集(例如:HDFS),则将RDD以分布方式(块)跨群集中的节点存储在计算机RAM中.

RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster, if the source data is an a cluster(eg: HDFS).

如果启用了缓存 checkpointing ,它也可能存储在内存或磁盘中.另外,改组总是涉及磁盘写入.

If caching or checkpointing is enabled it is also might be stored either in memory or on disk. Also, shuffling always involves disk write.

如果数据源只是单个CSV文件,则数据将分发到正在运行的服务器(如果是笔记本电脑)的RAM中的多个块.我对吗?

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

CSV文件将被分为多个分区,每个任务将仅读取大块数据(起始端偏移量).

CSV file will be split into multiple partitions, and each task will only read a chunk of data (start-end offsets).

块和分区之间是否存在任何关系?哪一个是超级套装?

Is there any relationship between block and partition? Which one is super set?

这有点令人困惑,请看以下 answer 表示 split 是输入数据的逻辑划分,而 block 是数据的物理划分.Spark使用自己的术语,Spark中的 partition 与Hadoop中的split具有大致相同的含义.

It is a bit confusing, take a look at this answer which states that split is a logical division of the input data while a block is a physical division of data. Spark uses its own terminology and partition in Spark has roughly the same meaning as split in Hadoop.

从HDFS读取文件时

When a file is read from HDFS HadoopRDD is being used and under the hood, each split will become a partition.

数据框:数据框是否也以与RDD相同的方式存储?如果我仅将源数据存储到数据帧中,是否将在支持中创建RDD?

Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

Dataframe只是幕后的RDD [InternalRow].
看看

Dataframe is nothing else than RDD[InternalRow] under the hood.
Take a look at the SparkPlan.

这篇关于Spark RDD与Dataframe-数据存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆