如何`map`和`reduce`方法星火RDDS工作? [英] How do `map` and `reduce` methods work in Spark RDDs?
问题描述
继code是来自Apache星火的快速入门指南。
有人可以解释我什么是线变量,它从何而来?
textFile.map(行=> line.split().size)。降低((A,B)=>如果(A> b)在其他b)
此外,如何一个值获得通过分为A,B?
链接到QSG http://spark.apache.org/docs/最新的/快速的start.html
首先,根据你的链接时,文本文件
创建为
VAL TEXTFILE = sc.textFile(README.md)
这样文本文件
是 RDD [字符串]
这意味着它是一个类型的弹性分布式数据集字符串
。访问的API是非常相似的,经常Scala集合的。
现在怎么做到这一点地图
吗?
假设你有字符串
秒的列表,并希望将其转换成整数的列表,再presenting每个字符串的长度。
VAL的StringList:列表[字符串] =表(AB,CDE,F)
VAL intList中:列表[INT] = stringlist.map(X => x.length)
的地图
方法需要的功能。一个函数,从字符串=&GT就去。诠释
。与该功能,该列表中的每个元素被变换。所以intList中的值为一览(2,3,1)
下面,我们创建从字符串= GT匿名函数;诠释
。这就是 X => x.length
。人们甚至可以写出功能更明确为
stringlist.map((X:字符串)=> x.length)
如果你使用上面写明确的,你可以
VAL stringLength:(字符串=>强度)= {
X => x.length
}
VAL intList中= stringlist.map(stringLength)
所以,这里绝对是显而易见的,那stringLength从字符串
到内部
的功能。
备注:在一般情况下,地图
是什么使了一个所谓的仿函数。当你从函子(在这里列出)的A => B,地图
提供了一个功能,您可以使用该功能还可从列表[A去] =>列表[B]
。这就是所谓的升降
回答您的问题
什么是行变量?
块引用>如前所述,
行
是函数行=>中输入参数; line.split().size
更明确
(行:字符串)=> line.split().size
例如:如果
。行
是世界你好,该函数返回2世界你好
= GT;阵列(你好,世界)//分裂
= GT; 2个达阵//大小
如何做一个价值获得通过分为A,B?
块引用>
减少
还预计,从(A,A)=&GT的函数; A
,其中A
是类型的RDD
。让我们调用这个函数运
。这是什么
减少
。例如:列表(1,2,3,4)。降低((X,Y)=> X + Y)
步骤1:运算(1,2)将第一次评估。
用1,2开始,即
x是1,y是2
步骤2:运算(运算(1,2),3) - 采取下一个元件3
采取下一个元素三:
x是运算(1,2)= 3且y = 3
步骤3:运算(运算(运算(1,2),3),4)
采取下一个元素四:
x是运算(运算(1,2),3)=运算(3,3)= 6,y是4这里结果是列表中的元素的总和,10
备注:在一般
减少
计算运算(OP(... OP(X_1,X_2)...,X_ {N-1}),x_n)
完整的例子
首先,文本文件是一个RDD [字符串],说
文本文件
你好Tyth
酷例如,对吧?
再见TextFile.map(线= GT; line.split().size)
2
3
1
TextFile.map(线= GT; line.split().size)。降低((A,B)=>如果(一个或GT; b)一种别的二)
3
步骤这里,召回`(A,B)=>如果(A> b)在其他B)`
- 运算(OP(2,3),1)的计算结果为运算(3,1),因为运算(2,3)= 3
- OP(3,1)= 3Following code is from the quick start guide of Apache Spark. Can somebody explain me what is the "line" variable and where it comes from?
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
Also, how does a value get passed into a,b?
Link to the QSG http://spark.apache.org/docs/latest/quick-start.html
解决方案First, according to your link, the
textfile
is created asval textFile = sc.textFile("README.md")
such that
textfile
is aRDD[String]
meaning it is a resilient distributed dataset of typeString
. The API to access is very similar to that of regular Scala collections.So now what does this
map
do?Imagine you have a list of
String
s and want to convert that into a list of Ints, representing the length of each String.val stringlist: List[String] = List("ab", "cde", "f") val intlist: List[Int] = stringlist.map( x => x.length )
The
map
method expects a function. A function, that goes fromString => Int
. With that function, each element of the list is transformed. So the value of intlist isList( 2, 3, 1 )
Here, we have created an anonymous function from
String => Int
. That isx => x.length
. One can even write the function more explicit asstringlist.map( (x: String) => x.length )
If you do use write the above explicit, you can
val stringLength : (String => Int) = { x => x.length } val intlist = stringlist.map( stringLength )
So, here it is absolutely evident, that stringLength is a function from
String
toInt
.Remark: In general,
map
is what makes up a so called Functor. While you provide a function from A => B,map
of the functor (here List) allows you use that function also to go fromList[A] => List[B]
. This is called lifting.Answers to your questions
What is the "line" variable?
As mentioned above,
line
is the input parameter of the functionline => line.split(" ").size
More explicit
(line: String) => line.split(" ").size
Example: If
line
is "hello world", the function returns 2."hello world" => Array("hello", "world") // split => 2 // size of Array
How does a value get passed into a,b?
reduce
also expects a function from(A, A) => A
, whereA
is the type of yourRDD
. Lets call this functionop
.What does
reduce
. Example:List( 1, 2, 3, 4 ).reduce( (x,y) => x + y ) Step 1 : op( 1, 2 ) will be the first evaluation. Start with 1, 2, that is x is 1 and y is 2 Step 2: op( op( 1, 2 ), 3 ) - take the next element 3 Take the next element 3: x is op(1,2) = 3 and y = 3 Step 3: op( op( op( 1, 2 ), 3 ), 4) Take the next element 4: x is op(op(1,2), 3 ) = op( 3,3 ) = 6 and y is 4
Result here is the sum of the list elements, 10.
Remark: In general
reduce
calculatesop( op( ... op(x_1, x_2) ..., x_{n-1}), x_n)
Full example
First, textfile is a RDD[String], say
TextFile "hello Tyth" "cool example, eh?" "goodbye" TextFile.map(line => line.split(" ").size) 2 3 1 TextFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) 3 Steps here, recall `(a, b) => if (a > b) a else b)` - op( op(2, 3), 1) evaluates to op(3, 1), since op(2, 3) = 3 - op( 3, 1 ) = 3
这篇关于如何`map`和`reduce`方法星火RDDS工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!