使用数据框的行之间的火花余弦距离 [英] Spark cosine distance between rows using Dataframe

查看:66
本文介绍了使用数据框的行之间的火花余弦距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须计算每行之间的余弦距离,但是我不知道如何优雅地使用Spark API数据帧来实现.这个想法是为每个行(项目)计算相似度,并通过比较行之间的相似度来获取前10个相似度. ->这是Item-Item推荐系统所必需的.

I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System.

我所读到的所有内容都涉及在列上计算相似性 DataFrame上的Apache Spark Python余弦相似度 有人可以说可以使用PySpark Dataframe的API或RDD来优雅地计算行之间的余弦距离吗?还是我必须手动进行计算?

All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames May someone say is it possible to compute a cosine distance elegantly between rows using PySpark Dataframe's API or RDD's or I have to do it manually?

那只是一些代码来显示我打算做什么

That's just some code to show what I intend to do

def cosineSimilarity(vec1, vec2):
    return vec1.dot(vec2) / (LA.norm(vec1) * LA.norm(vec2))


#p.s model is ALS
Pred_Factors = model.itemFactors.cache() #Pred_Factors = DataFrame[id: int, features: array<float>]

sims = []

for _id,_feature in Pred_Factors.toLocalIterator():
    for id, feature in Pred_Factors.toLocalIterator():

        itemFactor = _feature

        sims = sims.append(_id, cosineSimilarity(asarray(feature),itemFactor))

sims = sc.parallelize(l)
sortedSims = sims.takeOrdered(10, key=lambda x: -x[1])

预先感谢所有帮助

推荐答案

您可以使用mllib.feature.IndexedRowMatrixcolumnSimilarities函数.它使用余弦度量作为距离函数.它计算列之间的相似度,因此,您必须在应用此功能之前进行转置.

You can use mllib.feature.IndexedRowMatrix's columnSimilarities function. It uses cosine metrics as distance function. It computes similarities between columns so, you have to take transpose before applying this function.

pred_ = IndexedRowMatrix(Pred_Factors.rdd.map(lambda x: IndexedRow(x[0],x[1]))).toBlockMatrix().transpose().toIndexedRowMatrix()
pred_sims = pred.columnSimilarities()

这篇关于使用数据框的行之间的火花余弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆