火花地图只有一个任务,同时它应该是平行(PySpark) [英] Spark map is only one task while it should be parallel (PySpark)

查看:152
本文介绍了火花地图只有一个任务,同时它应该是平行(PySpark)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD与每10标准坐标7M左右的条目。我也有一些中心,我想每个条目到最近的(欧氏距离)中心地图。的问题是,这仅产生一个任务,这意味着它不是并行。这是形式:

I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form:

def doSomething(point,centers):
    for center in centers.value:
        if(distance(point,center)<1):
             return(center)
    return(None)

preppedData.map(lambda x:doSomething(x,centers)).take(5)

在preppedData RDD被缓存已经评价中,DoSomething的功能重新presented容易得多比实际的,但它同样的原则。中心是已广播的列表。为什么只有在一个任务是这样的地图吗?

The preppedData RDD is cached and already evaluated, the doSomething function is represented a lot easier than it actually is but it's the same principle. The centers is a list that has been broadcast. Why is this map only in one task?

类似件code的只是映射到+ - 100个任务,并获得所有执行程序中运行,这个人是在1执行人1任务。我的工作有8执行人与8 GB,每个执行人2个内核提供。

Similar pieces of code in other projects just map to +- 100 tasks and get run on all the executors, this one is 1 task on 1 executor. My job has 8 executors with 8 GB and 2 cores per executor available.

推荐答案

这可能是由于取()方法的保守性。
<一href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1201\"相对=nofollow>见RDD.scala的code。

This could be due to the conservative nature of the take() method. See the code in RDD.scala.

它所做的是你先RDD的第一个分区(如果您的RDD不需要洗牌,这将需要只有一个任务),如果有足够的结果在一个分区,它会返回。如果没有足够的数据在你的分区,它就会长出分区数它试图采取直到它得到所需数量的元素。

What it does is first take the first partition of your RDD (if your RDD doesn't require a shuffle, this will require only one task) and if there are enough results in that one partition, it will return that. If there is not enough data in your partition, it will then grow the number of partitions it tries to take until it gets the required number of elements.

由于您的RDD已经缓存,和你的操作仅仅是一个地图功能,只要您的任何RDDS有> 5行,这将只需要一个任务。更多任务是不必要的。

Since your RDD is already cached, and your operation is only a map function, as long as any of your RDDs have >5 rows, this will only ever require one task. More tasks would be unnecessary.

这code的存在是为了避免同时从所有分区取一小采取超载太多的数据驱动程序。

This code exists to avoid overloading the driver with too much data by fetching from all partitions at once for a small take.

这篇关于火花地图只有一个任务,同时它应该是平行(PySpark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆