哪些操作保留了 RDD 顺序? [英] Which operations preserve RDD order?

查看：26 发布时间：2021/11/12 5:30:30 apache-spark rdd

本文介绍了哪些操作保留了 RDD 顺序?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果 RDD 是由 sortBy()，如本回复中所述.

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply.

现在，哪些操作保留该顺序?

Now, which operations preserve that order?

例如，是否保证(在a.sortBy()之后)

a.map(f).zip(a) === 
a.map(x => (f(x),x))

怎么样

a.filter(f).map(g) === 
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)

怎么样

a.filter(f).flatMap(g) === 
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)

这里的相等"===理解为功能对等"，即使用用户级操作无法区分结果(即不读取日志&c).

Here "equality" === is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).

推荐答案

所有操作都保持顺序，除了那些明确不这样做的操作.排序总是有意义的"，而不仅仅是在 sortBy 之后.例如，如果你读取一个文件 (sc.textFile)，RDD 的行将按照它们在文件中的顺序.

All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy. For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file.

在不尝试给出完整列表的情况下，map、filter 和 flatMap 确实保留了顺序.sortBy、partitionBy、join 不保留顺序.

Without trying to give a complete list, map, filter and flatMap do preserve the order. sortBy, partitionBy, join do not preserve the order.

原因是大多数 RDD 操作都在分区内的 Iterator 上工作.所以 map 或 filter 没有办法弄乱顺序.你可以看看代码亲自查看.

The reason is that most RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order. You can take a look at the code to see for yourself.

您现在可能会问:如果我有一个带有 HashPartitioner 的 RDD 会怎样.当我使用 map 更改键时会发生什么?好吧，他们会留在原地，现在RDD不是按key分区的.您可以使用 partitionBy 使用 shuffle 恢复分区.

You may now ask: What if I have an RDD with a HashPartitioner. What happens when I use map to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy to restore the partitioning with a shuffle.

这篇关于哪些操作保留了 RDD 顺序?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

哪些操作保留了 RDD 顺序? [英] Which operations preserve RDD order?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

哪些操作保留了 RDD 顺序? [英] Which operations preserve RDD order?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭