如何`map`和`reduce`方法星火RDDS工作? [英] How do `map` and `reduce` methods work in Spark RDDs?

查看:180
本文介绍了如何`map`和`reduce`方法星火RDDS工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继code是来自Apache星火的快速入门指南。
有人可以解释我什么是线变量,它从何而来?

  textFile.map(行=> line.split().size)。降低((A,B)=>如果(A> b)在其他b)

此外,如何一个值获得通过分为A,B?

链接到QSG http://spark.apache.org/docs/最新的/快速的start.html


解决方案

首先,根据你的链接时,文本文件创建为

  VAL TEXTFILE = sc.textFile(README.md)

这样文本文件 RDD [字符串] 这意味着它是一个类型的弹性分布式数据集字符串。访问的API是非常相似的,经常Scala集合的。

现在怎么做到这一点地图吗?

假设你有字符串秒的列表,并希望将其转换成整数的列表,再presenting每个字符串的长度。

  VAL的StringList:列表[字符串] =表(AB,CDE,F)
VAL intList中:列表[INT] = stringlist.map(X => x.length)

地图方法需要的功能。一个函数,从字符串=&GT就去。诠释。与该功能,该列表中的每个元素被变换。所以intList中的值为一览(2,3,1)

下面,我们创建从字符串= GT匿名函数;诠释。这就是 X => x.length 。人们甚至可以写出功能更明确为

  stringlist.map((X:字符串)=> x.length)

如果你使用上面写明确的,你可以

  VAL stringLength:(字符串=>强度)= {
  X => x.length
}
VAL intList中= stringlist.map(stringLength)

所以,这里绝对是显而易见的,那stringLength从字符串内部的功能。

备注:在一般情况下,地图是什么使了一个所谓的仿函数。当你从函子(在这里列出)的A => B,地图提供了一个功能,您可以使用该功能还可从列表[A去] =>列表[B] 。这就是所谓的升降

回答您的问题


  

什么是行变量?


如前所述,是函数行=>中输入参数; line.split().size

更明确
(行:字符串)=> line.split().size

例如:如果是世界你好,该函数返回2

 世界你好
= GT;阵列(你好,世界)//分裂
= GT; 2个达阵//大小


  

如何做一个价值获得通过分为A,B?


减少还预计,从(A,A)=&GT的函数; A ,其中 A 是类型的 RDD 。让我们调用这个函数

这是什么减少。例如:

 列表(1,2,3,4)。降低((X,Y)=> X + Y)
步骤1:运算(1,2)将第一次评估。
  用1,2开始,即
    x是1,y是2
步骤2:运算(运算(1,2),3) - 采取下一个元件3
  采取下一个元素三:
    x是运算(1,2)= 3且y = 3
步骤3:运算(运算(运算(1,2),3),4)
  采取下一个元素四:
    x是运算(运算(1,2),3)=运算(3,3)= 6,y是4

这里结果是列表中的元素的总和,10

备注:在一般减少计算

 运算(OP(... OP(X_1,X_2)...,X_ {N-1}),x_n)

完整的例子

首先,文本文件是一个RDD [字符串],说

 文本文件
 你好Tyth
 酷例如,对吧?
 再见TextFile.map(线= GT; line.split().size)
 2
 3
 1
TextFile.map(线= GT; line.split().size)。降低((A,B)=>如果(一个或GT; b)一种别的二)
 3
   步骤这里,召回`(A,B)=>如果(A> b)在其他B)`
    - 运算(OP(2,3),1)的计算结果为运算(3,1),因为运算(2,3)= 3
    - OP(3,1)= 3

Following code is from the quick start guide of Apache Spark. Can somebody explain me what is the "line" variable and where it comes from?

textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)

Also, how does a value get passed into a,b?

Link to the QSG http://spark.apache.org/docs/latest/quick-start.html

解决方案

First, according to your link, the textfile is created as

val textFile = sc.textFile("README.md")

such that textfile is a RDD[String] meaning it is a resilient distributed dataset of type String. The API to access is very similar to that of regular Scala collections.

So now what does this map do?

Imagine you have a list of Strings and want to convert that into a list of Ints, representing the length of each String.

val stringlist: List[String] = List("ab", "cde", "f")
val intlist: List[Int] = stringlist.map( x => x.length )

The map method expects a function. A function, that goes from String => Int. With that function, each element of the list is transformed. So the value of intlist is List( 2, 3, 1 )

Here, we have created an anonymous function from String => Int. That is x => x.length. One can even write the function more explicit as

stringlist.map( (x: String) => x.length )  

If you do use write the above explicit, you can

val stringLength : (String => Int) = {
  x => x.length
}
val intlist = stringlist.map( stringLength )

So, here it is absolutely evident, that stringLength is a function from String to Int.

Remark: In general, map is what makes up a so called Functor. While you provide a function from A => B, map of the functor (here List) allows you use that function also to go from List[A] => List[B]. This is called lifting.

Answers to your questions

What is the "line" variable?

As mentioned above, line is the input parameter of the function line => line.split(" ").size

More explicit (line: String) => line.split(" ").size

Example: If line is "hello world", the function returns 2.

"hello world" 
=> Array("hello", "world")  // split 
=> 2                        // size of Array

How does a value get passed into a,b?

reduce also expects a function from (A, A) => A, where A is the type of your RDD. Lets call this function op.

What does reduce. Example:

List( 1, 2, 3, 4 ).reduce( (x,y) => x + y )
Step 1 : op( 1, 2 ) will be the first evaluation. 
  Start with 1, 2, that is 
    x is 1  and  y is 2
Step 2:  op( op( 1, 2 ), 3 ) - take the next element 3
  Take the next element 3: 
    x is op(1,2) = 3   and y = 3
Step 3:  op( op( op( 1, 2 ), 3 ), 4) 
  Take the next element 4: 
    x is op(op(1,2), 3 ) = op( 3,3 ) = 6    and y is 4

Result here is the sum of the list elements, 10.

Remark: In general reduce calculates

op( op( ... op(x_1, x_2) ..., x_{n-1}), x_n)

Full example

First, textfile is a RDD[String], say

TextFile
 "hello Tyth"
 "cool example, eh?"
 "goodbye"

TextFile.map(line => line.split(" ").size)
 2
 3
 1
TextFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
 3
   Steps here, recall `(a, b) => if (a > b) a else b)`
   - op( op(2, 3), 1) evaluates to op(3, 1), since op(2, 3) = 3 
   - op( 3, 1 ) = 3

这篇关于如何`map`和`reduce`方法星火RDDS工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆