map 和 flatMap 之间的区别是什么,每个都有什么好的用例? [英] What is the difference between map and flatMap and a good use case for each?

查看:36
本文介绍了map 和 flatMap 之间的区别是什么,每个都有什么好的用例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以向我解释 map 和 flatMap 之间的区别,以及它们各自的好的用例是什么吗?

Can someone explain to me the difference between map and flatMap and what is a good use case for each?

展平结果"是什么意思?有什么用?

What does "flatten the results" mean? What is it good for?

推荐答案

以下是区别的示例,作为 spark-shell 会话:

Here is an example of the difference, as a spark-shell session:

首先,一些数据——两行文字:

First, some data - two lines of text:

val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue"))  // lines

rdd.collect

    res0: Array[String] = Array("Roses are red", "Violets are blue")

现在,map 将一个长度为 N 的 RDD 转换为另一个长度为 N 的 RDD.

Now, map transforms an RDD of length N into another RDD of length N.

例如,它从两条线映射到两条线长:

For example, it maps from two lines into two line-lengths:

rdd.map(_.length).collect

    res1: Array[Int] = Array(13, 16)

但是 flatMap(粗略地说)将长度为 N 的 RDD 转换为 N 个集合的集合,然后将它们展平为单个 RDD 结果.

But flatMap (loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.

rdd.flatMap(_.split(" ")).collect

    res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")

我们每行有多个单词和多行,但我们最终得到一个单词的输出数组

We have multiple words per line, and multiple lines, but we end up with a single output array of words

只是为了说明这一点,从行集合到单词集合的 flatMapping 看起来像:

Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like:

["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]

因此对于 flatMap,输入和输出 RDD 通常具有不同的大小.

The input and output RDDs will therefore typically be of different sizes for flatMap.

如果我们尝试将 mapsplit 函数一起使用,我们最终会得到嵌套结构(单词数组的 RDD,类型为 >RDD[Array[String]]) 因为我们必须为每个输入只有一个结果:

If we had tried to use map with our split function, we'd have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]) because we have to have exactly one result per input:

rdd.map(_.split(" ")).collect

    res3: Array[Array[String]] = Array(
                                     Array(Roses, are, red), 
                                     Array(Violets, are, blue)
                                 )

最后,一个有用的特殊情况是映射一个可能不返回答案的函数,因此返回一个Option.我们可以使用 flatMap 过滤掉返回 None 的元素,并从那些返回 Some 的元素中提取值:

Finally, one useful special case is mapping with a function which might not return an answer, and so returns an Option. We can use flatMap to filter out the elements that return None and extract the values from those that return a Some:

val rdd = sc.parallelize(Seq(1,2,3,4))

def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None

rdd.flatMap(myfn).collect

    res3: Array[Int] = Array(10,20)

(这里注意一个 Option 的行为就像一个包含一个元素或零个元素的列表)

(noting here that an Option behaves rather like a list that has either one element, or zero elements)

这篇关于map 和 flatMap 之间的区别是什么,每个都有什么好的用例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆