Spark如何将数据加载到内存中 [英] How spark loads the data into memory

查看:524
本文介绍了Spark如何将数据加载到内存中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在执行火花过程中我完全感到困惑.我已经提到了可能的文章和教程,没有人详细讨论.我可能误会了火花.请纠正我.

I have total confusion in the spark execution process. I have referred may articles and tutorials, nobody is discussing in detailed. I might be wrongly understanding spark. Please correct me.

我有40GB的文件分布在10个节点群集的4个节点上(每个节点10GB). 当我在代码中说spark.read.textFile("test.txt")时,它将把来自所有4个节点的数据(40GB)加载到驱动程序(主节点)中吗? 或将此RDD将分别加载到所有4个节点中.在这种情况下,每个节点RDD应该保存10GB的物理数据,是吗? 整个RDD可以存储10GB数据并为每个分区执行任务,即spark 2.0中为128MB.最后将输出改组到驱动程序(主节点)

I have my file of 40GB distributed across 4 nodes (10GB each node) of the 10 node cluster. When I say spark.read.textFile("test.txt") in my code, will it load data(40GB) from all the 4 nodes into driver program (master node)? Or this RDD will be loaded in all the 4 nodes separately. In that case, each node RDD should hold 10GB of physical data, is it? And the whole RDD holds 10GB data and perform tasks for each partition i.e 128MB in spark 2.0. And finally shuffles the output to the driver program (master node)

我在某处读到"numbers of cores in Cluster = no. of partitions"意味着火花将一个节点的分区移动到所有10个节点进行处理?

And I read somewhere "numbers of cores in Cluster = no. of partitions" does it mean, the spark will move the partitions of one node to all 10 nodes for processing?

推荐答案

Spark不必一次将整个文件读入内存.那个40GB的文件被分成许多128MB(或任何分区大小)的分区.这些分区中的每个分区都是一个处理任务.每个内核一次只能处理一个任务,而优先选择处理数据分区存储在同一节点上的任务.仅需要读取正在处理的128MB分区,而不会读取文件的其余部分.任务完成(并产生一些输出)后,就可以读入下一个任务盘的128MB内存,并且可以从内存中释放出第一个任务的读入数据.因此,一次只需要将少量要处理的数据加载到内存中,而不是一次加载整个文件.

Spark doesn't have to read the whole file into memory at once. That 40GB file is split into many 128MB (or whatever your partition size is) partitions. Each of those partitions is a processing task. Each core will only work on one task at a time, with a preference to work on tasks where the data partition is stored on the same node. Only the 128MB partition that is being worked on needs to be read, the rest of the file is not read. Once the task completes (and produces some output) then the 128MB for the next task cab be read in and the data read in for the first task can be freed from memory. Because of this only the small amount of data being processed at a time needs to be loaded in to memory and not the entire file at once.

严格来说,spark.read.textFile("test.txt")也无济于事.它不读取数据,也不进行处理.它创建一个RDD,但RDD不包含任何数据. RDD只是一个执行计划. spark.read.textFile("test.txt")声明,如果并且在评估RDD而不执行任何操作的情况下,将读取文件test.txt作为数据源.

Also strictly speaking spark.read.textFile("test.txt") does nothing. It reads no data and does no processing. It creates an RDD but an RDD doesn't contain any data. And RDD is just an execution plan. spark.read.textFile("test.txt") declared that the file test.txt will be read an used as a source of data if and when the RDD is evaluated but doesn't do anything on its own.

这篇关于Spark如何将数据加载到内存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆