火花地图只有一个任务，同时它应该是平行（PySpark） [英] Spark map is only one task while it should be parallel (PySpark)

查看：152 发布时间：2016/5/22 16:42:33 apache-spark parallel-processing pyspark

本文介绍了火花地图只有一个任务，同时它应该是平行（PySpark）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个RDD与每10标准坐标7M左右的条目。我也有一些中心，我想每个条目到最近的（欧氏距离）中心地图。的问题是，这仅产生一个任务，这意味着它不是并行。这是形式：

I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form:

def doSomething(point,centers):
    for center in centers.value:
        if(distance(point,center)<1):
             return(center)
    return(None)

preppedData.map(lambda x:doSomething(x,centers)).take(5)

在preppedData RDD被缓存已经评价中，DoSomething的功能重新presented容易得多比实际的，但它同样的原则。中心是已广播的列表。为什么只有在一个任务是这样的地图吗？

The preppedData RDD is cached and already evaluated, the doSomething function is represented a lot easier than it actually is but it's the same principle. The centers is a list that has been broadcast. Why is this map only in one task?

类似件code的只是映射到+ - 100个任务，并获得所有执行程序中运行，这个人是在1执行人1任务。我的工作有8执行人与8 GB，每个执行人2个内核提供。

Similar pieces of code in other projects just map to +- 100 tasks and get run on all the executors, this one is 1 task on 1 executor. My job has 8 executors with 8 GB and 2 cores per executor available.

火花地图只有一个任务，同时它应该是平行（PySpark） [英] Spark map is only one task while it should be parallel (PySpark)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

火花地图只有一个任务，同时它应该是平行（PySpark） [英] Spark map is only one task while it should be parallel (PySpark)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭