头脑风暴:RDD.zip() 方法 [英] Mind blown: RDD.zip() method

查看:31
本文介绍了头脑风暴:RDD.zip() 方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是发现RDD.zip() 方法,我无法想象它的合同 可能是.

当然,我理解它做什么.然而,我的理解一直是

I understand what it does, of course. However, it has always been my understanding that

  • RDD中的元素顺序是一个毫无意义的概念
  • 分区的数量及其大小是一个实现细节,仅供用户用于性能调整

换句话说,RDD 是一个(multi)set,而不是一个序列(当然,在例如 Python 中,一个得到 AttributeError: 'set' 对象没有属性 'zip')

In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')

我上面的理解有什么问题?

What is wrong with my understanding above?

这种方法背后的原理是什么?

What was the rationale behind this method?

在像 a.map(f).zip(a) 这样的琐碎上下文之外是否合法?

Is it legal outside the trivial context like a.map(f).zip(a)?

编辑 1:

  • Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
  • Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
  • collect() is also okay - it just converts a set to a sequence which is perfectly legit.

编辑 2:回复说:

当您从另一个 RDD 计算一个 RDD 时,新 RDD 中元素的顺序可能与旧 RDD 中的元素顺序不一致.

when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.

这似乎意味着即使是微不足道的a.map(f).zip(a)保证等价于a.map(x => (f(x),x)).zip() 结果可重现时是什么情况?

This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?

推荐答案

RDD 并非总是无序的.例如,如果 RDD 是 sortBy 操作的结果,则它具有保证的顺序.RDD 不是一个集合;它可以包含重复项.分区对调用者来说不是不透明的,并且可以被控制和查询.许多操作确实保留了分区和顺序,例如 map.也就是说,我发现意外违反 zip 所依赖的假设有点容易,因为它们有点微妙,但它肯定是有目的的.

It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.

这篇关于头脑风暴:RDD.zip() 方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆