介意吹:RDD.zip()方法 [英] Mind blown: RDD.zip() method
问题描述
我刚刚发现的<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#zip%28org.apache.spark.rdd.RDD,%20scala.reflect.ClassTag%29\"><$c$c>RDD.zip()$c$c>方法我不能想象的的 的也可能会被。
I just discovered the RDD.zip()
method and I cannot imagine what its contract could possibly be.
我明白它的确实的,当然。然而,它一直是我的理解是,
I understand what it does, of course. However, it has always been my understanding that
- 元素的在RDD的的订单是一个毫无意义的概念
- 分区及其大小的数量是一个实现细节只提供给用户性能调整
- the order of elements in an RDD is a meaningless concept
- the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
在换句话说,RDD是的(多)设置的,不是的序的(而且,当然,在,例如,巨蟒一个得到 AttributeError的:'设置'对象有没有属性'拉链'
)
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip'
)
什么是错上面我的理解?
What is wrong with my understanding above?
是什么这个方法背后的理由?
What was the rationale behind this method?
这合法琐碎的背景下如 a.map(F).ZIP外(一)
?
Is it legal outside the trivial context like a.map(f).zip(a)
?
编辑1:
- 另一种疯狂的方法是<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#zipWithIndex%28%29\"><$c$c>zipWithIndex()$c$c>,以及以及各种<一href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#zipPartitions%28org.apache.spark.rdd.RDD,%20boolean,%20scala.Function2,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag%29\"><$c$c>zipPartitions()$c$c>变种。
- 请注意,<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#first%28%29\"><$c$c>first()$c$c>和<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#take%28int%29\"><$c$c>take()$c$c>是的不的疯狂,因为他们只是(非随机)的RDD样本。
- <$c$c>collect()$c$c>也没关系 - 它只是一个转换
设置
到序
这是完全合法的 。
- Another crazy method is
zipWithIndex()
, as well as well as the variouszipPartitions()
variants. - Note that
first()
andtake()
are not crazy because they are just (non-random) samples of the RDD. collect()
is also okay - it just converts aset
to asequence
which is perfectly legit.
编辑2:回复说:
当您从另一个计算一个RDD在新RDD元素的顺序可能与在旧的。
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
这似乎意味着,即使是微不足道的 a.map(F).ZIP(一)
为的不的保证等同于 a.map(X =&GT;(F(X),X))
。这是什么情况时,的zip()
结果的重现性的?
This appears to imply that even the trivial a.map(f).zip(a)
is not guaranteed to be equivalent to a.map(x => (f(x),x))
. What is the situation when zip()
results are reproducible?
推荐答案
这是不正确的RDDS始终是无序的。一个RDD有保证的顺序,如果它是一个 sortBy
操作的结果,例如。一个RDD不是一套;它可以包含重复。分区是不不透明的呼叫者,并且可以控制和查询。许多操作做preserve两个分区和顺序,如地图
。这就是说我觉得有点容易不小心违反了假设拉链
要看,因为他们是有点微妙,但它肯定有目的。
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy
operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map
. That said I find it a little easy to accidentally violate the assumptions that zip
depends on, since they're a little subtle, but it certainly has a purpose.
这篇关于介意吹:RDD.zip()方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!