使用数据框的行之间的火花余弦距离 [英] Spark cosine distance between rows using Dataframe
问题描述
我必须计算每行之间的余弦距离,但是我不知道如何优雅地使用Spark API数据帧来实现.这个想法是为每个行(项目)计算相似度,并通过比较行之间的相似度来获取前10个相似度. ->这是Item-Item推荐系统所必需的.
I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System.
我所读到的所有内容都涉及在列上计算相似性 DataFrame上的Apache Spark Python余弦相似度 有人可以说可以使用PySpark Dataframe的API或RDD来优雅地计算行之间的余弦距离吗?还是我必须手动进行计算?
All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames May someone say is it possible to compute a cosine distance elegantly between rows using PySpark Dataframe's API or RDD's or I have to do it manually?
那只是一些代码来显示我打算做什么
That's just some code to show what I intend to do
def cosineSimilarity(vec1, vec2):
return vec1.dot(vec2) / (LA.norm(vec1) * LA.norm(vec2))
#p.s model is ALS
Pred_Factors = model.itemFactors.cache() #Pred_Factors = DataFrame[id: int, features: array<float>]
sims = []
for _id,_feature in Pred_Factors.toLocalIterator():
for id, feature in Pred_Factors.toLocalIterator():
itemFactor = _feature
sims = sims.append(_id, cosineSimilarity(asarray(feature),itemFactor))
sims = sc.parallelize(l)
sortedSims = sims.takeOrdered(10, key=lambda x: -x[1])
预先感谢所有帮助
推荐答案
您可以使用mllib.feature.IndexedRowMatrix
的columnSimilarities
函数.它使用余弦度量作为距离函数.它计算列之间的相似度,因此,您必须在应用此功能之前进行转置.
You can use mllib.feature.IndexedRowMatrix
's columnSimilarities
function. It uses cosine metrics as distance function. It computes similarities between columns so, you have to take transpose before applying this function.
pred_ = IndexedRowMatrix(Pred_Factors.rdd.map(lambda x: IndexedRow(x[0],x[1]))).toBlockMatrix().transpose().toIndexedRowMatrix()
pred_sims = pred.columnSimilarities()
这篇关于使用数据框的行之间的火花余弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!