联合后JavaRdds中的行顺序 [英] Ordering of rows in JavaRdds after union

查看:56
本文介绍了联合后JavaRdds中的行顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出有关RDD中行顺序的任何信息.这是我正在尝试做的事情:

I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do:

Rdd1, Rdd2 
Rdd3 = Rdd1.union(rdd2); 

在Rdd3中,是否可以保证rdd1记录将首先出现,然后rdd2出现?为了我的测试,我看到了这个行为联盟发生,但无法在任何文档中找到它.

in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs.

只是FI,我真的不在乎RDD本身的顺序(即rdd2或rdd1的数据顺序确实不重要,但在联合Rdd1记录数据之后才是必需的).

just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).

推荐答案

在Spark中,特定分区中的元素是无序的,但是分区本身是有序的

In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered http://spark.apache.org/docs/latest/programming-guide.html#background

如果检查RDD3,应该发现RDD3只是RDD1的所有分区,然后是RDD2的所有分区,因此在这种情况下,结果恰好按照您想要的方式排序.您可以在此处阅读,Spark

If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?

因此,在这种情况下,工会似乎会给您您想要的东西.但是,此行为是Union的实现细节,它不是其接口定义的一部分,因此您不能依赖于将来不会以其他行为重新实现它的事实.

So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.

这篇关于联合后JavaRdds中的行顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆