哪些操作保留了 RDD 顺序? [英] Which operations preserve RDD order?
问题描述
RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy()
, as explained in this reply.
现在,哪些操作保留该顺序?
Now, which operations preserve that order?
例如,是否保证(在a.sortBy()
之后)
a.map(f).zip(a) ===
a.map(x => (f(x),x))
怎么样
a.filter(f).map(g) ===
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)
怎么样
a.filter(f).flatMap(g) ===
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)
这里的相等"===
理解为功能对等",即使用用户级操作无法区分结果(即不读取日志&c).
Here "equality" ===
is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).
推荐答案
所有操作都保持顺序,除了那些明确不这样做的操作.排序总是有意义的",而不仅仅是在 sortBy
之后.例如,如果你读取一个文件 (sc.textFile
),RDD 的行将按照它们在文件中的顺序.
All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy
. For example, if you read a file (sc.textFile
) the lines of the RDD will be in the order that they were in the file.
在不尝试给出完整列表的情况下,map
、filter
和 flatMap
确实保留了顺序.sortBy
、partitionBy
、join
不保留顺序.
Without trying to give a complete list, map
, filter
and flatMap
do preserve the order. sortBy
, partitionBy
, join
do not preserve the order.
原因是大多数 RDD 操作都在分区内的 Iterator
上工作.所以 map
或 filter
没有办法弄乱顺序.你可以看看 代码亲自查看.
The reason is that most RDD operations work on Iterator
s inside the partitions. So map
or filter
just has no way to mess up the order. You can take a look at the code to see for yourself.
您现在可能会问:如果我有一个带有 HashPartitioner
的 RDD 会怎样.当我使用 map
更改键时会发生什么?好吧,他们会留在原地,现在RDD不是按key分区的.您可以使用 partitionBy
使用 shuffle 恢复分区.
You may now ask: What if I have an RDD with a HashPartitioner
. What happens when I use map
to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy
to restore the partitioning with a shuffle.
这篇关于哪些操作保留了 RDD 顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!