在斯卡拉星火分布地图 [英] Distributed Map in Scala Spark

查看:197
本文介绍了在斯卡拉星火分布地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

星火是否支持分布式地图集合类型?

Does Spark support distributed Map collection types ?

所以,如果我有一个HashMap的[字符串,字符串]这是关键,值对,这可以被转换成一个分布式的地图集合类型?要访问我可以使用元素过滤,但我怀疑这个执行以及地图?

So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?

推荐答案

因为我发现了一些新的信息,我想我应该把我的意见变成一个答案。 @maasg已经覆盖查找的功能,我想指出,你要小心,因为如果RDD的分区是无,查找只是使用过滤器反正标准。在参考了(K,V)在火花顶部店看起来这是在进步,但可用的拉请求已经提出的此处。下面是一个例子使用。

Since I found some new info I thought I'd turn my comments into an answer. @maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.

import org.apache.spark.rdd.IndexedRDD

// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()

// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)

// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))

// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None

好像拉入请求获得一致好评,并可能会被列入火花的未来版本,所以它可能是安全的使用在自己的code,它拉的请求。这里是如果你好奇的 JIRA票

It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious

这篇关于在斯卡拉星火分布地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆