哪些操作preserve RDD订单? [英] Which operations preserve RDD order?

查看:213
本文介绍了哪些操作preserve RDD订单?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RDD具有的有意义的(相对于由存储模型施加一些随机顺序)命令,如果它是由<一个处理href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#sortBy%28scala.Function1,%20boolean,%20int,%20scala.math.Ordering,%20scala.reflect.ClassTag%29\"><$c$c>sortBy(),在此回复。

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply.

现在,该操作的 preserve 的顺序?

Now, which operations preserve that order?

如,是不是保证的(后 a.sortBy()

a.map(f).zip(a) === 
a.map(x => (f(x),x))

关于

a.filter(f).map(g) === 
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)

关于

a.filter(f).flatMap(g) === 
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)

这里的平等 === 被理解为功能对等,也就是说,没有办法使用用户级操作(即区分的结果,没有阅读日志和放大器; C)

Here "equality" === is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).

推荐答案

所有的操作preserve顺序,除了那些明确不。排序始终是有意义的,不只是后 sortBy 。例如,如果你读一个文件( sc.textFile )的RDD的线路将在,他们在文件中的顺序。

All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy. For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file.

但不尝试给出一个完整列表,地图过滤 flatMap COALESCE (用 =洗牌虚假)做preserve的顺序。 sortBy partitionBy 加入不要preserve顺序。

Without trying to give a complete list, map, filter, flatMap, and coalesce (with shuffle=false) do preserve the order. sortBy, partitionBy, join do not preserve the order.

原因是,大多数RDD操作上的Iterator 的工作分区中。因此,地图过滤只是有没有办法把事情弄糟顺序。你可以看看的<一个href=\"https://github.com/apache/spark/tree/v1.3.0/core/src/main/scala/org/apache/spark/rdd\">$c$c要看到自己。

The reason is that most RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order. You can take a look at the code to see for yourself.

您现在可能会问:如果我有一个 HashPartitioner 的RDD。当我使用地图来更改密钥会发生什么?那么,他们将留在原地,现在RDD不被分割的关键。您可以使用 partitionBy 来与洗牌恢复分区。

You may now ask: What if I have an RDD with a HashPartitioner. What happens when I use map to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy to restore the partitioning with a shuffle.

这篇关于哪些操作preserve RDD订单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆