介意吹:RDD.zip()方法 [英] Mind blown: RDD.zip() method

查看:137
本文介绍了介意吹:RDD.zip()方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现的<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#zip%28org.apache.spark.rdd.RDD,%20scala.reflect.ClassTag%29\"><$c$c>RDD.zip()方法我不能想象的 也可能会被。

I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.

我明白它的确实的,当然。然而,它一直是我的理解是,

I understand what it does, of course. However, it has always been my understanding that


  • 元素的在RDD的的订单是一个毫无意义的概念

  • 分区及其大小的数量是一个实现细节只提供给用户性能调整

  • the order of elements in an RDD is a meaningless concept
  • the number of partitions and their sizes is an implementation detail only available to the user for performance tuning

在换句话说,RDD是的(多)设置的,不是的的(而且,当然,在,例如,巨蟒一个得到 AttributeError的:'设置'对象有没有属性'拉链'

In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')

什么是错上面我的理解?

What is wrong with my understanding above?

是什么这个方法背后的理由?

What was the rationale behind this method?

这合法琐碎的背景下如 a.map(F).ZIP外(一)

Is it legal outside the trivial context like a.map(f).zip(a)?

编辑1:


  • 另一种疯狂的方法是<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#zipWithIndex%28%29\"><$c$c>zipWithIndex(),以及以及各种<一href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#zipPartitions%28org.apache.spark.rdd.RDD,%20boolean,%20scala.Function2,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag%29\"><$c$c>zipPartitions()变种。

  • 请注意,<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#first%28%29\"><$c$c>first()和<一个href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#take%28int%29\"><$c$c>take()是的的疯狂,因为他们只是(非随机)的RDD样本。

  • <$c$c>collect()也没关系 - 它只是一个转换设置这是完全合法的

  • Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
  • Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
  • collect() is also okay - it just converts a set to a sequence which is perfectly legit.

编辑2:回复说:

当您从另一个计算一个RDD在新RDD元素的顺序可能与在旧的。

when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.

这似乎意味着,即使是微不足道的 a.map(F).ZIP(一)保证等同于 a.map(X =&GT;(F(X),X))。这是什么情况时,的zip()结果的重现性的?

This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?

推荐答案

这是不正确的RDDS始终是无序的。一个RDD有保证的顺序,如果它是一个 sortBy 操作的结果,例如。一个RDD不是一套;它可以包含重复。分区是不不透明的呼叫者,并且可以控制和查询。许多操作做preserve两个分区和顺序,如地图。这就是说我觉得有点容易不小心违反了假设拉链要看,因为他们是有点微妙,但它肯定有目的。

It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.

这篇关于介意吹:RDD.zip()方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆