Spark RDD-它们如何工作 [英] Spark RDD's - how do they work

查看：91 发布时间：2020/9/4 3:11:48 scala apache-spark bigdata distributed-computing rdd

本文介绍了Spark RDD-它们如何工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个小的Scala程序，可以在单节点上很好地运行.但是，我正在扩展它，使其可以在多个节点上运行.这是我的第一次尝试.我只是想了解RDD在Spark中的工作方式，所以这个问题是基于理论的，可能不是100％正确.

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.

假设我创建了一个RDD: val rdd = sc.textFile(file)

Let's say I create an RDD: val rdd = sc.textFile(file)

现在，一旦完成此操作，是否表示file处的文件现在已跨节点分区(假设所有节点都可以访问文件路径)?

Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?

第二，我想计算RDD中的对象数量(足够简单)，但是，我需要在计算中使用该数字，该计算需要应用于RDD中的对象-伪代码示例:

Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:

rdd.map(x => x / rdd.size)

假设rdd中有100个对象，并说有10个节点，因此每个节点有10个对象(假设这是RDD概念的工作原理)，现在当我调用该方法时，每个节点都在运行用rdd.size作为10或100执行计算?因为总体而言，RDD的大小为100，但在本地每个节点上的大小仅为10.在执行计算之前，我需要做一个广播变量吗?该问题与下面的问题相关.

Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.

最后，如果我对RDD进行了转换，例如rdd.map(_.split("-"))，然后我想要新的RDD size，是否需要在RDD上执行操作，例如count()，以便将所有信息发送回驱动程序节点?

Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?

Spark RDD-它们如何工作 [英] Spark RDD's - how do they work

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark RDD-它们如何工作 [英] Spark RDD&#39;s - how do they work

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark RDD-它们如何工作 [英] Spark RDD's - how do they work

登录关闭