头脑风暴:RDD.zip() 方法 [英] Mind blown: RDD.zip() method
问题描述
我只是发现RDD.zip()
方法,我无法想象它的合同 可能是.
当然,我理解它做什么.然而,我的理解一直是
I understand what it does, of course. However, it has always been my understanding that
- RDD中的元素顺序是一个毫无意义的概念
- 分区的数量及其大小是一个实现细节,仅供用户用于性能调整
换句话说,RDD 是一个(multi)set,而不是一个序列(当然,在例如 Python 中,一个得到 AttributeError: 'set' 对象没有属性 'zip'
)
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip'
)
我上面的理解有什么问题?
What is wrong with my understanding above?
这种方法背后的原理是什么?
What was the rationale behind this method?
在像 a.map(f).zip(a)
这样的琐碎上下文之外是否合法?
Is it legal outside the trivial context like a.map(f).zip(a)
?
编辑 1:
- 另一个疯狂的方法是
zipWithIndex()
,以及各种zipPartitions()
变体. - 注意
first()
和take()
不很疯狂,因为它们只是(非随机)样本RDD. collect()
也可以 - 它只是将set
转换为完全合法的sequence
.
- Another crazy method is
zipWithIndex()
, as well as well as the variouszipPartitions()
variants. - Note that
first()
andtake()
are not crazy because they are just (non-random) samples of the RDD. collect()
is also okay - it just converts aset
to asequence
which is perfectly legit.
编辑 2:回复说:
当您从另一个 RDD 计算一个 RDD 时,新 RDD 中元素的顺序可能与旧 RDD 中的元素顺序不一致.
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
这似乎意味着即使是微不足道的a.map(f).zip(a)
也不保证等价于a.map(x => (f(x),x))
.zip()
结果可重现时是什么情况?
This appears to imply that even the trivial a.map(f).zip(a)
is not guaranteed to be equivalent to a.map(x => (f(x),x))
. What is the situation when zip()
results are reproducible?
推荐答案
RDD 并非总是无序的.例如,如果 RDD 是 sortBy
操作的结果,则它具有保证的顺序.RDD 不是一个集合;它可以包含重复项.分区对调用者来说不是不透明的,并且可以被控制和查询.许多操作确实保留了分区和顺序,例如 map
.也就是说,我发现意外违反 zip
所依赖的假设有点容易,因为它们有点微妙,但它肯定是有目的的.
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy
operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map
. That said I find it a little easy to accidentally violate the assumptions that zip
depends on, since they're a little subtle, but it certainly has a purpose.
这篇关于头脑风暴:RDD.zip() 方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!