图VS在星火mapValues [英] map vs mapValues in Spark

查看:105
本文介绍了图VS在星火mapValues的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在学习Spark和开发自定义的机器学习算法。我的问题是什么的区别 .MAP() .mapValues​​()有什么情况下,我清楚地必须使用一个替代的其他的

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?

推荐答案

mapValues​​ 只适用于PairRDDs,意为形式的RDDS RDD [ (A,B)] 。在这种情况下, mapValues​​ 运行在的的只(元组的第二部分),而地图上工作的整个记录的(中键和值元组)

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

在换句话说,给定 F:B => ç RDD:RDD [(A,B] ,这两个是相同的(几乎是 - 看底部的注释):

In other words, given f: B => C and rdd: RDD[(A, B]], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

后者仅仅是短,更清晰,所以当你只是想改变的价值观和保持键原样,建议使用 mapValues​​

在另一方面,如果你想太多变换键(例如,你想申请 F:(A,B)=以及c ),您只需不能使用 mapValues​​ ,因为它只会值传递给你的函数。

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

上次不同的关注分区:如果应用任何自定义分区您RDD(例如,使用 partitionBy ),使用地图将忘记了paritioner(结果将恢复为默认分区)作为钥匙可能已经改变; mapValues​​ ,但是,preserves任何分区的设置RDD

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

这篇关于图VS在星火mapValues的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆