广播惹恼对象火花(最近邻)? [英] Broadcast Annoy object in Spark (for nearest neighbors)?

查看:550
本文介绍了广播惹恼对象火花(最近邻)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花的mllib没有近邻的功能,我尝试使用惹恼了近似​​最近邻居。我尝试播放惹恼对象,并把它传递给工人;然而,因为预期它不工作

下面是code可重复性(在PySpark运行)。问题是突出使用与惹恼不VS星火时看到的区别。

 从搅扰进口AnnoyIndex
进口随机
random.seed(42)F = 40
项目载体T = AnnoyIndex(F)#长度将被编入索引
allvectors = []
在的xrange I(20):
    V = [random.gauss(0,1),用于在的xrange Z(F)]
    t.add_item(I,V)
    allvectors.append((I,V))
t.build(10)#10棵树#使用惹恼星火
sparkvectors = sc.parallelize(allvectors)
BCT = sc.broadcast(T)
X = sparkvectors.map(拉姆达X:bct.value.get_nns_by_vector(矢量= X [1]中,n = 5))
打印与星火第一矢量五近邻,
打印x.first()#使用惹恼没有星火
打印对于没有星火第一矢量五近邻,
打印(t.get_nns_by_vector(矢量= allvectors [0] [1]中,n = 5))

输出看出:


  

对于第一个向量五近邻星火:无


  
  

5个最近的邻居第一矢量没有星火:[0,13,12,6,4]



解决方案

我从来没有用过惹恼但我pretty确保包装说明解释是怎么回事就在这里:


  

,还创建了被mmapped到内存中,使许多过程可以共享相同的数据大只读基于文件的数据结构。


由于它是当你序列化,并把它传递给工人的所有数据在途中失去使用内存映射的索引。

尝试这样的事情,而不是:

 从pyspark进口SparkFilest.save(index.ann)
sc.addPyFile(index.ann)高清find_neighbors(ITER):
    T = AnnoyIndex(F)
    t.load(SparkFiles.get(index.ann))
    返回(t.get_nns_by_vector(矢量= X [1]中,n = 5),用于在ITER x)的sparkvectors.mapPartitions(find_neighbors)。首先()
## [0,13,12,6,4]

As Spark's mllib doesn't have nearest-neighbors functionality, I'm trying to use Annoy for approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it to workers; however, it does not operate as expected.

Below is code for reproducibility (to be run in PySpark). The problem is highlighted in the difference seen when using Annoy with vs without Spark.

from annoy import AnnoyIndex
import random
random.seed(42)

f = 40
t = AnnoyIndex(f)  # Length of item vector that will be indexed
allvectors = []
for i in xrange(20):
    v = [random.gauss(0, 1) for z in xrange(f)]
    t.add_item(i, v)
    allvectors.append((i, v))
t.build(10) # 10 trees

# Use Annoy with Spark
sparkvectors = sc.parallelize(allvectors)
bct = sc.broadcast(t)
x = sparkvectors.map(lambda x: bct.value.get_nns_by_vector(vector=x[1], n=5))
print "Five closest neighbors for first vector with Spark:",
print x.first()

# Use Annoy without Spark
print "Five closest neighbors for first vector without Spark:",
print(t.get_nns_by_vector(vector=allvectors[0][1], n=5))

Output seen:

Five closest neighbors for first vector with Spark: None

Five closest neighbors for first vector without Spark: [0, 13, 12, 6, 4]

解决方案

I've never used Annoy but I am pretty sure that the package description explains what is going on here:

It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Since it is using memory mapped indexes when you serialize it and pass it to the workers all data is lost on the way.

Try something like this instead:

from pyspark import SparkFiles

t.save("index.ann")
sc.addPyFile("index.ann")

def find_neighbors(iter):
    t = AnnoyIndex(f)
    t.load(SparkFiles.get("index.ann"))
    return (t.get_nns_by_vector(vector=x[1], n=5) for x in iter)

sparkvectors.mapPartitions(find_neighbors).first()
## [0, 13, 12, 6, 4]

这篇关于广播惹恼对象火花(最近邻)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆