Spark RDD-它们如何工作 [英] Spark RDD's - how do they work

查看:91
本文介绍了Spark RDD-它们如何工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个小的Scala程序,可以在单节点上很好地运行.但是,我正在扩展它,使其可以在多个节点上运行.这是我的第一次尝试.我只是想了解RDD在Spark中的工作方式,所以这个问题是基于理论的,可能不是100%正确.

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.

假设我创建了一个RDD: val rdd = sc.textFile(file)

Let's say I create an RDD: val rdd = sc.textFile(file)

现在,一旦完成此操作,是否表示file处的文件现在已跨节点分区(假设所有节点都可以访问文件路径)?

Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?

第二,我想计算RDD中的对象数量(足够简单),但是,我需要在计算中使用该数字,该计算需要应用于RDD中的对象-伪代码示例:

Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:

rdd.map(x => x / rdd.size)

假设rdd中有100个对象,并说有10个节点,因此每个节点有10个对象(假设这是RDD概念的工作原理),现在当我调用该方法时,每个节点都在运行用rdd.size作为10100执行计算?因为总体而言,RDD的大小为100,但在本地每个节点上的大小仅为10.在执行计算之前,我需要做一个广播变量吗?该问题与下面的问题相关.

Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.

最后,如果我对RDD进行了转换,例如rdd.map(_.split("-")),然后我想要新的RDD size,是否需要在RDD上执行操作,例如count(),以便将所有信息发送回驱动程序节点?

Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?

推荐答案

通常,文件(或文件的一部分,如果太大)将被复制到群集中的N个节点(默认情况下,HDFS上为N = 3) .并非打算在所有可用节点之间分割每个文件.

Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.

但是,对于您(即客户端)使用Spark处理文件应该是透明的-无论拆分和/或复制了多少个节点,您都不会在rdd.size中看到任何区别.有一些方法(至少在Hadoop中),可以确定当前文件可以位于哪些节点(文件的一部分)上.但是,在简单的情况下,您很可能不需要使用此功能.

However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.

更新:描述RDD内部的文章: https://cs. stanford.edu/~matei/papers/2012/nsdi_spark.pdf

UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

这篇关于Spark RDD-它们如何工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆