为什么 Apache Spark 采取功能不并行? [英] Why Apache Spark take function not parallel?
问题描述
在
我认为文档是错误的.本地计算仅在需要单个分区时发生.这是第一遍的情况(获取分区 0),但通常不会出现在后面的遍中.
Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :
Why is take function not run in parallel? What are the difficulties in implementing this type of function in parallel ? Is it something to do with fact that in order to take first n elements of RDD it is required to traverse entire RDD ?
Actually, while take
is not entirely parallel, it's not entirely sequential either.
For example let's say you take(200)
, and each partition has 10 elements. take
will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.
Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049
I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.
这篇关于为什么 Apache Spark 采取功能不并行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!