请问groupByKey在星火preserve原来的顺序? [英] Does groupByKey in Spark preserve the original order?

查看:145
本文介绍了请问groupByKey在星火preserve原来的顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在星火中,<一个href=\"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions\">groupByKey功能将一个(K,V)对RDD成(K,可迭代&LT; V&GT;)。对RDD

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable<V>) pair RDD.

然而,就是这个功能是否稳定?即是从原来的顺序pserved可迭代$ P $的顺序?

Yet, is this function stable? i.e is the order in the iterable preserved from the original order?

例如,如果我原来读格式的文件:

For example, if I originally read a file of the form:

K1;V11
K2;V21
K1;V12

愿我可迭代的 K1 (V12,V11)(这样的不preserving 原始订单),也可以只为(V11,V12)(这样的 preserving 的原始顺序)?

May my iterable for K1 be like (V12, V11) (thus not preserving the original order) or can it only be (V11, V12) (thus preserving the original order)?

推荐答案

没有,顺序并不preserved。例如,在火花壳

No, the order is not preserved. Example in spark-shell:

scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))

的顺序的时序依赖性的,因此它可以运行之间变化。 (我得到了我下次运行相反的顺序。)

The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)

这是怎么回事? groupByKey 的工作原理与重新分区RDD一个 HashPartitioner ,这样在了在同一个分区为主要终端的所有值。然后在每个分区上执行局部聚集。

What is happening here? groupByKey works by repartitioning the RDD with a HashPartitioner, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.

的重新分区也称为洗牌,因为RDD的线的节点之间重新分配。洗牌文件从并行的其他节点拉动。新的分区是从这些作品中,他们到达的顺序建成。从最慢的源中的数据将在新分区的结束,并在列表的最后 groupByKey

The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey.

(数据从工人本身被拉当然是最快的。由于在这里所涉及,这个数据是同步拉动,并因此到达顺序。没有网络传送(这似乎至少)。因此,要复制我的实验中,你需要至少2个星火工人。)

来源:<一个href=\"http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html\">http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html

这篇关于请问groupByKey在星火preserve原来的顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆