使用Dataframe的行之间的火花余弦距离 [英] Spark cosine distance between rows using Dataframe

查看:26
本文介绍了使用Dataframe的行之间的火花余弦距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须计算每行之间的余弦距离,但我不知道如何优雅地使用 Spark API Dataframes 来完成它.这个想法是计算每一行(项目)的相似性,并通过比较它们在行之间的相似性来获取前 10 个相似性.--> 这是 Item-Item Recommender System 所需要的.

I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System.

我所读到的所有内容都涉及计算列的相似性数据帧上的 Apache Spark Python 余弦相似度有人会说是否可以使用 PySpark Dataframe 的 API 或 RDD 来优雅地计算行之间的余弦距离,或者我必须手动计算?

All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames May someone say is it possible to compute a cosine distance elegantly between rows using PySpark Dataframe's API or RDD's or I have to do it manually?

这只是一些代码来显示我打算做什么

That's just some code to show what I intend to do

def cosineSimilarity(vec1, vec2):
    return vec1.dot(vec2) / (LA.norm(vec1) * LA.norm(vec2))


#p.s model is ALS
Pred_Factors = model.itemFactors.cache() #Pred_Factors = DataFrame[id: int, features: array<float>]

sims = []

for _id,_feature in Pred_Factors.toLocalIterator():
    for id, feature in Pred_Factors.toLocalIterator():

        itemFactor = _feature

        sims = sims.append(_id, cosineSimilarity(asarray(feature),itemFactor))

sims = sc.parallelize(l)
sortedSims = sims.takeOrdered(10, key=lambda x: -x[1])

提前感谢所有帮助

推荐答案

您可以使用 mllib.feature.IndexedRowMatrixcolumnSimilarities 函数.它使用余弦度量作为距离函数.它计算列之间的相似度,因此您必须在应用此函数之前进行转置.

You can use mllib.feature.IndexedRowMatrix's columnSimilarities function. It uses cosine metrics as distance function. It computes similarities between columns so, you have to take transpose before applying this function.

pred_ = IndexedRowMatrix(Pred_Factors.rdd.map(lambda x: IndexedRow(x[0],x[1]))).toBlockMatrix().transpose().toIndexedRowMatrix()
pred_sims = pred.columnSimilarities()

这篇关于使用Dataframe的行之间的火花余弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆