打开RDD到广播词典查询 [英] Turn RDD into Broadcast Dictionary for lookup

查看：268 发布时间：2016/5/22 16:45:28 python apache-spark

本文介绍了打开RDD到广播词典查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我至今是：

 查找= sc.textFile（/用户/ myuser的/ lookup.asv）
lookup.map（拉姆达R：r.split（CHR（1）））

和现在我有一个RDD看起来像

  [
    [文件名1，类别1]
    [文件名2，类别2]
    ...
    [filenamen，categoryn]
]

我怎样才能将这种RDD成广播字典，如：

  {文件名1：类别1，文件名2：类别2，...}

这是我试过，但没有工作：

 ＆GT;＆GT;＆GT; broadcastVar = sc.broadcast（{}）
＆GT;＆GT;＆GT;数据= sc.parallelize（[1,1]，[2,2]，[3,3]，[4,4]）
＆GT;＆GT;＆GT;高清MYFUNC（X）：
... broadcastVar [STR（X [0]）] = X [1]
...
＆GT;＆GT;＆GT;结果= Data.Map中（MYFUNC）
＆GT;＆GT;＆GT; broadcastVar
＆LT;在0x7f776555e710＆GT pyspark.broadcast.Broadcast对象;
＆GT;＆GT;＆GT; broadcastVar.value
{}
＆GT;＆GT;＆GT; result.collect（）
...
错误：类型错误：广播对象不支持项目转让
...
＆GT;＆GT;＆GT; broadcastVar.value
{}

有关为什么我建立这个巨大的查找变量的详细信息，请阅读本：

这是这个之一的后续问题。

我有两个表，其中

表1：其中各列包含该像素信息和第一列非常宽（25K列和150K行）表是输入图象文件的文件名。

表2：即有3万行，每行包含图片的文件名和图像的产品类别的TSV（制表符分隔的文件）的文件。

在SQL说，我需要做一个内部连接上这两个表的文件名，所以我可以标记的图像数据后机器学习。

因为你必须创建一个表table1的有25K列，创建表的语法将是可笑的长这是不现实的做，在任何类型的SQL语句。

然后我想用表2，也许让一个广播变量，其中的关键是文件名，值是产品类别创建查找变量。

解决方案

广播变量是只读的工人。星火提供了是只写但这些预期的东西像柜台蓄电池。在这里，你可以简单地收集并创建一个Python字典：

  lookup_bd = sc.broadcast（{
  K：v对于（K，V）在lookup.map。（拉姆达R：r.split（CHR（1）））收集（）
}）

因为你必须创建一个表table1的有25K列，创建表的语法将是可笑的长这是不现实的做，在任何类型的SQL语句。

创作不应该是一个问题。只要你知道的名字就可以轻松创建表这样的编程方式：
 从pyspark.sql进口排colnames =X {0}。格式（i）就我范围（25000）]＃与实际名称替换DF = sc.parallelize（[
   行（* [randint（0,100），用于在_范围（25000）]）在范围X（10）
]）。toDF（）## LEN（df.columns）
## 25000
 
这里有一个问题，甚至当你使用纯RDDS这是更为严重。很宽行一般来说很难在任何逐行格式来处理。
一件事你可以做的是使用稀疏重presentation像 SparseVector 或稀疏矩阵。另一种是EN例如使用RLE code像素信息。
What I have so far is:
lookup = sc.textFile("/user/myuser/lookup.asv")
lookup.map(lambda r: r.split(chr(1)) )
And now I have a RDD looks like
[
    [filename1, category1],
    [filename2, category2],
    ...
    [filenamen, categoryn]
]
How can I turn that RDD into a broadcasted dictionary like:
{filename1: category1, filename2: category2, ...}
This is what I have tried but not working:
>>> broadcastVar = sc.broadcast({})
>>> data = sc.parallelize([[1,1], [2,2], [3,3], [4,4]])
>>> def myfunc(x):
...     broadcastVar[str(x[0])] = x[1]
... 
>>> result = data.map(myfunc)
>>> broadcastVar
<pyspark.broadcast.Broadcast object at 0x7f776555e710>
>>> broadcastVar.value
{}
>>> result.collect()
...
ERROR: TypeError: 'Broadcast' object does not support item assignment
...
>>> broadcastVar.value
{}
For more information about why I am building this huge lookup variable, read this:

This is a followup question of this one.

I have two tables where

table1: a very wide (25K columns and 150K rows) table where each column contains the pixel info and the first column is the filename of the input image file.

table2: a TSV (tab delimited file) file that has 3 million rows and each row contains the image file name and the product category of the image.

Speaking in SQL, I need to do a inner join on those two tables on the filename so I can label the image data for later on machine learning.

It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.

Then I am thinking about creating a lookup variable using table2 and maybe make it a broadcast variable where the key is the filename and the value is the product category.
解决方案
Broadcast variables are read-only on workers. Spark provides accumulators which are write only but these are intended for things like counters. Here you can simply collect and create a Python dictionary:
lookup_bd = sc.broadcast({
  k: v for (k, v) in lookup.map(lambda r: r.split(chr(1))).collect()
})
It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.

Creation shouldn't be a problem. As long you know the names you can easily create table like this programmatically:
from pyspark.sql import Row

colnames = ["x{0}".format(i) for i in range(25000)] # Replace with actual names

df = sc.parallelize([
   row(*[randint(0, 100) for _ in range(25000)]) for x in range(10)
]).toDF()

## len(df.columns)
## 25000
There is another problem here which is much more serious even when you use plain RDDs. Very wide rows are generally speaking hard to handle in any row-wise format.

One thing you can do is use sparse representation like SparseVector or SparseMatrix. Another is to encode pixel info for example using RLE.

这篇关于打开RDD到广播词典查询的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

打开RDD到广播词典查询 [英] Turn RDD into Broadcast Dictionary for lookup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

打开RDD到广播词典查询 [英] Turn RDD into Broadcast Dictionary for lookup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭