图VS在星火mapValues [英] map vs mapValues in Spark
问题描述
目前,我正在学习Spark和开发自定义的机器学习算法。我的问题是什么的区别 .MAP()
和 .mapValues()
有什么情况下,我清楚地必须使用一个替代的其他的
I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map()
and .mapValues()
and what are cases where I clearly have to use one instead of the other?
推荐答案
mapValues
只适用于PairRDDs,意为形式的RDDS RDD [ (A,B)]
。在这种情况下, mapValues
运行在的值的只(元组的第二部分),而地图
上工作的整个记录的(中键和值元组)
mapValues
is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]
. In that case, mapValues
operates on the value only (the second part of the tuple), while map
operates on the entire record (tuple of key and value).
在换句话说,给定 F:B => ç
和 RDD:RDD [(A,B]
,这两个是相同的(几乎是 - 看底部的注释):
In other words, given f: B => C
and rdd: RDD[(A, B]]
, these two are identical (almost - see comment at the bottom):
val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }
val result: RDD[(A, C)] = rdd.mapValues(f)
后者仅仅是短,更清晰,所以当你只是想改变的价值观和保持键原样,建议使用 mapValues
。
在另一方面,如果你想太多变换键(例如,你想申请 F:(A,B)=以及c
),您只需不能使用 mapValues
,因为它只会值传递给你的函数。
On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C
), you simply can't use mapValues
because it would only pass the values to your function.
上次不同的关注分区:如果应用任何自定义分区您RDD(例如,使用 partitionBy
),使用地图
将忘记了paritioner(结果将恢复为默认分区)作为钥匙可能已经改变; mapValues
,但是,preserves任何分区的设置RDD
The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy
), using map
would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues
, however, preserves any partitioner set on the RDD.
这篇关于图VS在星火mapValues的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!