map 和 flatMap 之间的区别是什么,每个都有什么好的用例? [英] What is the difference between map and flatMap and a good use case for each?
问题描述
有人可以向我解释 map 和 flatMap 之间的区别,以及它们各自的好的用例是什么吗?
Can someone explain to me the difference between map and flatMap and what is a good use case for each?
展平结果"是什么意思?有什么用?
What does "flatten the results" mean? What is it good for?
推荐答案
以下是区别的示例,作为 spark-shell
会话:
Here is an example of the difference, as a spark-shell
session:
首先,一些数据——两行文字:
First, some data - two lines of text:
val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines
rdd.collect
res0: Array[String] = Array("Roses are red", "Violets are blue")
现在,map
将一个长度为 N 的 RDD 转换为另一个长度为 N 的 RDD.
Now, map
transforms an RDD of length N into another RDD of length N.
例如,它从两条线映射到两条线长:
For example, it maps from two lines into two line-lengths:
rdd.map(_.length).collect
res1: Array[Int] = Array(13, 16)
但是 flatMap
(粗略地说)将长度为 N 的 RDD 转换为 N 个集合的集合,然后将它们展平为单个 RDD 结果.
But flatMap
(loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.
rdd.flatMap(_.split(" ")).collect
res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")
我们每行有多个单词和多行,但我们最终得到一个单词的输出数组
We have multiple words per line, and multiple lines, but we end up with a single output array of words
只是为了说明这一点,从行集合到单词集合的 flatMapping 看起来像:
Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like:
["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]
因此对于 flatMap
,输入和输出 RDD 通常具有不同的大小.
The input and output RDDs will therefore typically be of different sizes for flatMap
.
如果我们尝试将 map
与 split
函数一起使用,我们最终会得到嵌套结构(单词数组的 RDD,类型为 >RDD[Array[String]]
) 因为我们必须为每个输入只有一个结果:
If we had tried to use map
with our split
function, we'd have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]
) because we have to have exactly one result per input:
rdd.map(_.split(" ")).collect
res3: Array[Array[String]] = Array(
Array(Roses, are, red),
Array(Violets, are, blue)
)
最后,一个有用的特殊情况是映射一个可能不返回答案的函数,因此返回一个Option
.我们可以使用 flatMap
过滤掉返回 None
的元素,并从那些返回 Some
的元素中提取值:
Finally, one useful special case is mapping with a function which might not return an answer, and so returns an Option
. We can use flatMap
to filter out the elements that return None
and extract the values from those that return a Some
:
val rdd = sc.parallelize(Seq(1,2,3,4))
def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None
rdd.flatMap(myfn).collect
res3: Array[Int] = Array(10,20)
(这里注意一个 Option 的行为就像一个包含一个元素或零个元素的列表)
(noting here that an Option behaves rather like a list that has either one element, or zero elements)
这篇关于map 和 flatMap 之间的区别是什么,每个都有什么好的用例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!