联合后JavaRdds中的行顺序 [英] Ordering of rows in JavaRdds after union
问题描述
我试图找出有关RDD中行顺序的任何信息.这是我正在尝试做的事情:
I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do:
Rdd1, Rdd2
Rdd3 = Rdd1.union(rdd2);
在Rdd3中,是否可以保证rdd1记录将首先出现,然后rdd2出现?为了我的测试,我看到了这个行为联盟发生,但无法在任何文档中找到它.
in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs.
只是FI,我真的不在乎RDD本身的顺序(即rdd2或rdd1的数据顺序确实不重要,但在联合Rdd1记录数据之后才是必需的).
just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).
推荐答案
在Spark中,特定分区中的元素是无序的,但是分区本身是有序的
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered http://spark.apache.org/docs/latest/programming-guide.html#background
如果检查RDD3,应该发现RDD3只是RDD1的所有分区,然后是RDD2的所有分区,因此在这种情况下,结果恰好按照您想要的方式排序.您可以在此处阅读,Spark
If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?
因此,在这种情况下,工会似乎会给您您想要的东西.但是,此行为是Union的实现细节,它不是其接口定义的一部分,因此您不能依赖于将来不会以其他行为重新实现它的事实.
So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.
这篇关于联合后JavaRdds中的行顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!