如何控制RDD分区的首选位置? [英] How to control preferred locations of RDD partitions?

查看:30
本文介绍了如何控制RDD分区的首选位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法手动设置RDD分区的首选位置?我想确保在某台机器上计算某个分区.

Is there a way to set the preferred locations of RDD partitions manually? I want to make sure certain partition be computed in a certain machine.

我正在使用数组和Parallelize"方法从中创建 RDD.

I'm using an array and the 'Parallelize' method to create a RDD from that.

另外我没有使用 HDFS,文件在本地磁盘上.所以我要修改执行节点.

Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node.

推荐答案

有没有办法手动设置RDD分区的preferredLocations?

Is there a way to set the preferredLocations of RDD partitions manually?

是的,有,但它是特定于 RDD 的,因此不同类型的 RDD 有不同的方法.

Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.

Spark 使用 RDD.preferredLocations 获取首选位置列表以计算每个分区/拆分(例如 HDFS 文件的块位置).

Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).

final def preferredLocations(split: Partition): Seq[String]

获取分区的首选位置,同时考虑 RDD 是否被检查点.

Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.

如您所见,该方法是 final,这意味着没有人可以覆盖它.

As you see the method is final which means that no one can ever override it.

当你查看 源代码 of RDD.preferredLocations 你会看到一个 RDD 如何知道它的首选位置.它使用受保护的 RDD.getPreferredLocations 方法,自定义 RDD 可以(但不必)覆盖以指定放置首选项.

When you look at the source code of RDD.preferredLocations you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

所以,现在问题已经变形"了;另一个关于允许设置其首选位置的 RDD 是什么.找到你的并查看源代码.

So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.

我正在使用数组和Parallelize"方法从中创建 RDD.

I'm using an array and the 'Parallelize' method to create a RDD from that.

如果您并行化您的本地数据集,它不再是分布式的,并且可以是这样的,但是...为什么要使用 Spark 来处理您可以在单个计算机/节点上本地处理的内容?

If you parallelize your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?

如果您坚持并且确实想将 Spark 用于本地数据集,那么 SparkContext.parallelize 背后的 RDD 是...让我们看看源代码... ParallelCollectionRDD 其中 确实允许位置偏好.

If however you insist and do really want to use Spark for local datasets, the RDD behind SparkContext.parallelize is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.

然后让我们将您的问题改写为以下内容(希望我不会丢失任何重要事实):

Let's then rephrase your question to the following (hoping I won't lose any important fact):

允许创建 ParallelCollectionRDD 并明确指定位置首选项的运算符是什么?

What are the operators that allow for creating a ParallelCollectionRDD and specifying the location preferences explicitly?

令我惊讶的是(因为我不知道该功能),有这样一个操作符,即 SparkContext.makeRDD,即...接受一个或多个每个对象的位置首选项(Spark 节点的主机名).

To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.

ma​​keRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] 分发一个本地Scala集合形成一个RDD,具有一个或多个位置偏好每个对象的(Spark 节点的主机名).为每个集合项创建一个新分区.

makeRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object. Create a new partition for each collection item.

换句话说,您必须使用 makeRDD(可在 Spark Core API for Scala 中使用,但我不确定 Python 是否可用),而不是使用 parallelise我离开作为你的家庭锻炼:))

In other words, rather than using parallelise you have to use makeRDD (which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))

我将同样的推理应用于创建某种 RDD 的任何其他 RDD 操作符/转换.

The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.

这篇关于如何控制RDD分区的首选位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆