在星火马preduce意外结果 [英] Unexpected results in Spark MapReduce
问题描述
我是新来的火花,并希望了解马preduce如何被引擎盖下完成,以确保我用它正常。 这篇文章提供了极大的答案,但我的成绩似乎并没有跟随描述的逻辑。我运行星火快速入门指南 Scala中的命令线。当我做线路长度此外正确,事情出来就好了。线路总长为1213:
I'm new to Spark and want to understand how MapReduce gets done under the hood to ensure I use it properly. This post provided a great answer, but my results don't seem to follow the logic described. I'm running the Spark Quick Start guide in Scala on command line. When I do line length addition properly, things come out just fine. Total line length is 1213:
scala> val textFile = sc.textFile("README.md")
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
scala> val linesWithSparkLengths = linesWithSpark.map(s => s.length)
scala> linesWithSparkLengths.foreach(println)
Result:
14
78
73
42
68
17
62
45
76
64
54
74
84
29
136
77
77
73
70
scala> val totalLWSparkLength = linesWithSparkLengths.reduce((a,b) => a+b)
totalLWSparkLength: Int = 1213
当我调整稍微用(A-B),而不是(A + B)
When I tweak it slightly to use (a-b) instead of (a+b),
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
我的预期-1185,根据这个帖子逻辑:
List(14,78,73,42,68,17,62,45,76,64,54,74,84,29,136,77,77,73,70).reduce( (x,y) => x - y )
Step 1 : op( 14, 78 ) will be the first evaluation.
x is 14 and y is 78. Result of x - y = -64.
Step 2: op( op( 14, 78 ), 73 )
x is op(14,78) = -64 and y = 73. Result of x - y = -137
Step 3: op( op( op( 14, 78 ), 73 ), 42)
x is op( op( 14, 78 ), 73 ) = -137 and y is 42. Result is -179.
...
Step 18: op( (... ), 73), 70) will be the final evaluation.
x is -1115 and y is 70. Result of x - y is -1185.
然而,奇怪的事情发生了:
However, something strange happens:
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = 151
当我再次运行...
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = -151
谁能告诉我,为什么结果是151(或-151),而不是-1185?
Can anyone tell me why the result is 151 (or -151) instead of -1185?
推荐答案
这是因为减法是联想也不可交换都不是。让我们开始关联:
It happens because subtraction is neither associative nor commutative. Lets start with associativity:
(- (- (- 14 78) 73) 42)
(- (- -64 73) 42)
(- -137 42)
-179
是不一样的
(- (- 14 78) (- 73 42))
(- -64 (- 73 42))
(- -64 31)
-95
现在它的时间为可交换:
Now its time for commutativity:
(- (- (- 14 78) 73) 42) ;; From the previous example
是不一样的
(- (- (- 42 73) 78) 14)
(- (- -31 78) 14)
(- -109 14)
-123
星火首先适用减少
个人分区,然后合并以任意顺序部分结果。如果使用的功能不符合一个或两个标准最后结果可以是非确定性的。
Spark first applies reduce
on individual partitions and then merges partial results in arbitrary order. If function you use doesn't meet one or both criteria final results can be non-deterministic.
这篇关于在星火马preduce意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!