按元素相乘 SparseVectors [英] Multiply SparseVectors element-wise
问题描述
我有 2RDD,我想在这 2 个 rdd 之间按元素相乘.
I am having 2RDD and I want to multiply element-wise between these 2 rdd.
假设我有以下 RDD(示例):
Lets say that I am having the following RDD(example):
a = ((1,[0.28,1,0.55]),(2,[0.28,1,0.55]),(3,[0.28,1,0.55]))
aRDD = sc.parallelize(a)
b = ((1,[0.28,0,0]),(2,[0,0,0]),(3,[0,1,0]))
bRDD = sc.parallelize(b)
可以看出 b
是稀疏的,我想避免将零值与另一个值相乘.我正在执行以下操作:
It can be seen that b
is sparse and I want to avoid multiply a zero value with another value. I am doing the following:
from pyspark.mllib.linalg import Vectors
def create_sparce_matrix(a_list):
length = len(a_list)
index = [i for i ,e in enumerate(a_list) if e !=0]
value = [e for i ,e in enumerate(a_list) if e !=0]
sv1 = Vectors.sparse(length,index,value)
return sv1
brdd = b.map(lambda (ids,a_list):(ids,create_sparce_matrix(a_list)))
和乘法:
combinedRDD = ardd + brdd
result = combinedRDD.reduceByKey(lambda a,b:[c*d for c,d in zip(a,b)])
似乎我无法将一个 sparce 与 RDD 中的列表相乘.有没有办法做到这一点?或者当两个 RDD 中的一个有很多零值时,另一种有效的元素相乘方法?
It seems that I can't multiply an sparce with a list in RDD. Is there a way to do it?Or another effiecient way to multiply element-wise when one of the two RDD has a lot of zero values?
推荐答案
处理此问题的一种方法是将 aRDD
转换为 RDD[DenseVector]
:
One way you can handle this is to convert aRDD
to RDD[DenseVector]
:
from pyspark.mllib.linalg import SparseVector, DenseVector, Vectors
aRDD = sc.parallelize(a).mapValues(DenseVector)
bRDD = sc.parallelize(b).mapValues(create_sparce_matrix)
并使用基本的 NumPy 操作:
and use basic NumPy operations:
def mul(x, y):
assert isinstance(x, DenseVector)
assert isinstance(y, SparseVector)
assert x.size == y.size
return SparseVector(y.size, y.indices, x[y.indices] * y.values)
aRDD.join(bRDD).mapValues(lambda xy: mul(*xy))
这篇关于按元素相乘 SparseVectors的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!