创建于pyspark大词典 [英] Creating a large dictionary in pyspark

查看：138 发布时间：2016/5/22 15:59:41 python apache-spark

本文介绍了创建于pyspark大词典的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解决使用pyspark以下问题。
我对格式HDFS文件是查找表的转储。

I am trying to solve the following problem using pyspark. I have a file on hdfs in the format which is a dump of lookup table.

key1, value1
key2, value2
...

我想这个加载到Python字典中pyspark，并将其用于其他目的。于是，我就这样做：

I want to load this into python dictionary in pyspark and use it for some other purpose. So I tried to do:

table = {}
def populateDict(line):
    (k,v) = line.split(",", 1)
    table[k] = v

kvfile = sc.textFile("pathtofile")
kvfile.foreach(populateDict)

我发现，表变量没有被修改。那么，有没有一种方法来创建的火花大inmemory哈希表？

I found that table variable is not modified. So, is there a way to create a large inmemory hashtable in spark?

推荐答案

的foreach 是一个分布式计算，所以你不能指望它来修改datasctructure只在可见光驱动程序。你想要的是。

foreach is a distributed computation so you can't expect it to modify a datasctructure only visible in the driver. What you want is.

kv.map(line => { line.split(" ") match { 
    case Array(k,v) => (k,v)
    case _ => ("","")
}.collectAsMap()

这是Scala，但你的想法，最重要的功能是 collectAsMap（）返回地图给司机。

This is in scala but you get the idea, the important function is collectAsMap() which returns a map to the driver.

如果你的数据是非常大的，你可以使用PairRDD作为地图。首先映射到对

If you're data is very large you can use a PairRDD as a map. First map to pairs

    kv.map(line => { line.split(" ") match { 
        case Array(k,v) => (k,v)
        case _ => ("","")
    }

然后你可以用 rdd.lookup访问（钥匙）返回与该键关联值的序列，尽管这肯定不会像其他高效分布式千伏商店，火花是不是真的建了点。

then you can access with rdd.lookup("key") which returns a sequence of values associated with the key, though this definitely will not be as efficient as other distributed KV stores, as spark isn't really built for that.

这篇关于创建于pyspark大词典的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

创建于pyspark大词典 [英] Creating a large dictionary in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

创建于pyspark大词典 [英] Creating a large dictionary in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭