创建于pyspark大词典 [英] Creating a large dictionary in pyspark

查看:138
本文介绍了创建于pyspark大词典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解决使用pyspark以下问题。
我对格式HDFS文件是查找表的转储。

I am trying to solve the following problem using pyspark. I have a file on hdfs in the format which is a dump of lookup table.

key1, value1
key2, value2
...

我想这个加载到Python字典中pyspark,并将其用于其他目的。于是,我就这样做:

I want to load this into python dictionary in pyspark and use it for some other purpose. So I tried to do:

table = {}
def populateDict(line):
    (k,v) = line.split(",", 1)
    table[k] = v

kvfile = sc.textFile("pathtofile")
kvfile.foreach(populateDict)

我发现,表变量没有被修改。那么,有没有一种方法来创建的火花大inmemory哈希表?

I found that table variable is not modified. So, is there a way to create a large inmemory hashtable in spark?

推荐答案

的foreach 是一个分布式计算,所以你不能指望它来修改datasctructure只在可见光驱动程序。你想要的是。

foreach is a distributed computation so you can't expect it to modify a datasctructure only visible in the driver. What you want is.

kv.map(line => { line.split(" ") match { 
    case Array(k,v) => (k,v)
    case _ => ("","")
}.collectAsMap()

这是Scala,但你的想法,最重要的功能是 collectAsMap()返回地图给司机。

This is in scala but you get the idea, the important function is collectAsMap() which returns a map to the driver.

如果你的数据是非常大的,你可以使用PairRDD作为地图。首先映射到对

If you're data is very large you can use a PairRDD as a map. First map to pairs

    kv.map(line => { line.split(" ") match { 
        case Array(k,v) => (k,v)
        case _ => ("","")
    }

然后你可以用 rdd.lookup访问(钥匙)返回与该键关联值的序列,尽管这肯定不会像其他高效分布式千伏商店,火花是不是真的建了点。

then you can access with rdd.lookup("key") which returns a sequence of values associated with the key, though this definitely will not be as efficient as other distributed KV stores, as spark isn't really built for that.

这篇关于创建于pyspark大词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆