如何从余弦相似度矩阵中获取商品ID? [英] How to get item id from cosine similarity matrix?

查看:105
本文介绍了如何从余弦相似度矩阵中获取商品ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark Scala计算数据帧行之间的余弦相似度.

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

数据框架构如下:

root
    |-- itemId: string (nullable = true)
    |-- features: vector (nullable = true)

下面的数据框示例

    +-------+--------------------+
    | itemId|            features|
    +-------+--------------------+
    | ab    |[4.7143,0.0,5.785...|
    | cd    |[5.5,0.0,6.4286,4...|
    | ef    |[4.7143,1.4286,6....|
    ........
    +-------+--------------------+

用于计算余弦相似度的代码:

Code to compute the cosine similarities:

val irm = new IndexedRowMatrix(myDataframe.rdd.zipWithIndex().map {
      case (row, index) => IndexedRow(row.getAs[Vector]("features"), index)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

在irm矩阵中,我有(i,j,score),其中i,j是项目i的索引,而j是我的原始数据帧的索引. 我想通过将这个irm与初始数据帧结合起来,或者是否有更好的选择来获得(itemIdA,itemIdB,分数),其中itemIdA和itemIdB分别是索引i和j的ID.

In the irm matrix, I have (i, j, score) where i, j are the indexes of item i, and j of my original dataframe. What I would like is to get (itemIdA, itemIdB, score) where itemIdA and itemIdB are the ids of index i and j respectively, by joining this irm with the initial dataframe or if there is any better option?

推荐答案

在将数据帧转换为矩阵之前创建行索引,并在索引和id之间创建映射.计算后,使用创建的Map将列索引(以前是行索引,但已用transpose更改)转换为id.

Create a row index before converting the dataframe to a matrix and create a mapping between the index and the id. After the computation, use the created Map to convert the column index (previously row index but changed with the transpose) to the id.

val rdd = myDataframe.as[(String, org.apache.spark.mllib.linalg.Vector)].rdd.zipWithIndex()
val indexMap = rdd.map{case ((id, vec), index) => (index, id)}.collectAsMap()

使用之前的方法计算余弦相似度:

Calculate the cosine similarities as before using the :

val irm = new IndexedRowMatrix(rdd.map{case ((id, vec), index) => IndexedRow(index, vec)})
  .toCoordinateMatrix().transpose().toRowMatrix().columnSimilarities()

将列索引转换回ID:

irm.entries.map(e => (indexMap(e.i), indexMap(e.j), e.value)) 

这应该给您您想要的东西.

This should give you what you are looking for.

这篇关于如何从余弦相似度矩阵中获取商品ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆