打开RDD到广播词典查询 [英] Turn RDD into Broadcast Dictionary for lookup
问题描述
我至今是:
查找= sc.textFile(/用户/ myuser的/ lookup.asv)
lookup.map(拉姆达R:r.split(CHR(1)))
和现在我有一个RDD看起来像
[
[文件名1,类别1]
[文件名2,类别2]
...
[filenamen,categoryn]
]
我怎样才能将这种RDD成广播字典,如:
{文件名1:类别1,文件名2:类别2,...}
这是我试过,但没有工作:
>>> broadcastVar = sc.broadcast({})
>>>数据= sc.parallelize([1,1],[2,2],[3,3],[4,4])
>>>高清MYFUNC(X):
... broadcastVar [STR(X [0])] = X [1]
...
>>>结果= Data.Map中(MYFUNC)
>>> broadcastVar
<在0x7f776555e710&GT pyspark.broadcast.Broadcast对象;
>>> broadcastVar.value
{}
>>> result.collect()
...
错误:类型错误:广播对象不支持项目转让
...
>>> broadcastVar.value
{}
有关为什么我建立这个巨大的查找变量的详细信息,请阅读本:
我有两个表,其中
表1:其中各列包含该像素信息和第一列非常宽(25K列和150K行)表是输入图象文件的文件名。
表2:即有3万行,每行包含图片的文件名和图像的产品类别的TSV(制表符分隔的文件)的文件。
在SQL说,我需要做一个内部连接上这两个表的文件名,所以我可以标记的图像数据后机器学习。
因为你必须创建一个表table1的有25K列,创建表的语法将是可笑的长这是不现实的做,在任何类型的SQL语句。
然后我想用表2,也许让一个广播变量,其中的关键是文件名,值是产品类别创建查找变量。
广播变量是只读的工人。星火提供了是只写但这些预期的东西像柜台蓄电池。在这里,你可以简单地收集并创建一个Python字典:
lookup_bd = sc.broadcast({
K:v对于(K,V)在lookup.map。(拉姆达R:r.split(CHR(1)))收集()
})
因为你必须创建一个表table1的有25K列,创建表的语法将是可笑的长这是不现实的做,在任何类型的SQL语句。
块引用>创作不应该是一个问题。只要你知道的名字就可以轻松创建表这样的编程方式:
从pyspark.sql进口排colnames =X {0}。格式(i)就我范围(25000)]#与实际名称替换DF = sc.parallelize([
行(* [randint(0,100),用于在_范围(25000)])在范围X(10)
])。toDF()## LEN(df.columns)
## 25000这里有一个问题,甚至当你使用纯RDDS这是更为严重。很宽行一般来说很难在任何逐行格式来处理。
一件事你可以做的是使用稀疏重presentation像
SparseVector
或稀疏矩阵
。另一种是EN例如使用RLE code像素信息。What I have so far is:
lookup = sc.textFile("/user/myuser/lookup.asv") lookup.map(lambda r: r.split(chr(1)) )
And now I have a RDD looks like
[ [filename1, category1], [filename2, category2], ... [filenamen, categoryn] ]
How can I turn that RDD into a broadcasted dictionary like:
{filename1: category1, filename2: category2, ...}
This is what I have tried but not working:
>>> broadcastVar = sc.broadcast({}) >>> data = sc.parallelize([[1,1], [2,2], [3,3], [4,4]]) >>> def myfunc(x): ... broadcastVar[str(x[0])] = x[1] ... >>> result = data.map(myfunc) >>> broadcastVar <pyspark.broadcast.Broadcast object at 0x7f776555e710> >>> broadcastVar.value {} >>> result.collect() ... ERROR: TypeError: 'Broadcast' object does not support item assignment ... >>> broadcastVar.value {}
For more information about why I am building this huge lookup variable, read this:
This is a followup question of this one.
I have two tables where
table1: a very wide (25K columns and 150K rows) table where each column contains the pixel info and the first column is the filename of the input image file.
table2: a TSV (tab delimited file) file that has 3 million rows and each row contains the image file name and the product category of the image.
Speaking in SQL, I need to do a inner join on those two tables on the filename so I can label the image data for later on machine learning.
It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.
Then I am thinking about creating a lookup variable using table2 and maybe make it a broadcast variable where the key is the filename and the value is the product category.
解决方案Broadcast variables are read-only on workers. Spark provides accumulators which are write only but these are intended for things like counters. Here you can simply collect and create a Python dictionary:
lookup_bd = sc.broadcast({ k: v for (k, v) in lookup.map(lambda r: r.split(chr(1))).collect() })
It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.
Creation shouldn't be a problem. As long you know the names you can easily create table like this programmatically:
from pyspark.sql import Row colnames = ["x{0}".format(i) for i in range(25000)] # Replace with actual names df = sc.parallelize([ row(*[randint(0, 100) for _ in range(25000)]) for x in range(10) ]).toDF() ## len(df.columns) ## 25000
There is another problem here which is much more serious even when you use plain RDDs. Very wide rows are generally speaking hard to handle in any row-wise format.
One thing you can do is use sparse representation like
SparseVector
orSparseMatrix
. Another is to encode pixel info for example using RLE.这篇关于打开RDD到广播词典查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!