使用项目相似性的csr_matrix可获得与项目X最为相似的项目，而无需将csr_matrix转换为密集矩阵 [英] Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

查看：167 发布时间：2020/8/6 2:45:13 python scipy sparse-matrix cosine-similarity

本文介绍了使用项目相似性的csr_matrix可获得与项目X最为相似的项目，而无需将csr_matrix转换为密集矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有购买数据( df_temp ).我设法将Pandas Dataframe替换为稀疏的csr_matrix，因为我有很多产品(89000)，必须获得其用户项目信息(已购买或未购买)，然后计算产品之间的相似性.

I have a purchase data (df_temp). I managed to replace using Pandas Dataframe to using a sparse csr_matrix because I have lots of products (89000) which I have to get their user-item information (purchased or not purchased) and then calculate the similarities between products.

首先，我将Pandas DataFrame转换为Numpy数组:

First, I converted Pandas DataFrame to Numpy array:

 df_user_product = df_temp[['user_id','product_id']].copy()
 ar1 = np.array(df_user_product.to_records(index=False))

第二，创建一个 coo_matrix 因为它以稀疏矩阵构造的快速性着称.

Second, created a coo_matrix because it's known for being fast in sparse matrix construction.

 rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)
 cols, c_pos = np.unique(ar1['user_id'], return_inverse=True)
 s = sparse.coo_matrix((np.ones(r_pos.shape,int), (r_pos, c_pos)))

第三，对于矩阵计算，最好使用csr_matrix或 csr_matrix ，因为我在row中有product_id =>比csc_matrix更有效的行切片.

Third, for matrix calculations, it's better to use csr_matrix or csc_matrix, so I used csr_matrix as I have the product_id(s) in rows => more effective row slicing than csc_matrix.

    sparse_csr_mat = s.tocsr()
    sparse_csr_mat[sparse_csr_mat > 1] = 1

然后，我计算了 哪个是:

<89447x89447 sparse matrix of type '<type 'numpy.float64'>' with 1332945 stored elements in Compressed Sparse Row format>

现在，我想在末尾有一个字典，其中每个产品都有5个最相似的产品.怎么做?由于内存使用限制，我不想将稀疏矩阵转换为密集数组.但是我也不知道是否有像访问数组那样访问csr_matrix的方法，例如检查index = product_id并获取index = product_id的所有行，那样我将获得所有类似的产品product_id并按余弦相似度值排序，以获取最相似的5个.

Now, I want to have at the end a dictionary where for each product, there is the 5 most similar products. How to do it? I don't want to convert the sparse matrix to a dense array because of memory usage constraints. But I also didn't know if there is a way to access the csr_matrix like we do for array where we check for example index=product_id and get all the rows where the index=product_id, that way I will get all the similar products to product_id and sort by cosine similarity value to get the 5 most similar.

例如，相似度矩阵中的一行:

For example, a row in similarities matrix:

(product_id1, product_id2) 0.45

如何仅对与product_id1相似的X个(在我的情况下为= 5)产品进行过滤，而不必将矩阵转换为数组?

how to filter on only the X (=5 in my case) most similar products to product_id1, without having to convert the matrix to an array?

在 Stackoverflow 中，认为lil_matrix可以用于这种情况吗?怎么样?

Looking in Stackoverflow, I think lil_matrix can be used for this case? how?

感谢您的帮助！

推荐答案

我终于了解了如何获得与每种产品最相似的5种产品，这是通过使用.tolil()矩阵，然后将每一行转换为一个numpy数组来实现的.并使用argsort获取5个最相似的项目.我使用了中建议的@hpaulj解决方案>链接.

I finally understood how I can get the 5 most similar items to each products and this is by using .tolil() matrix and then convert each row to a numpy array and use argsort to get the 5 most similar items. I used @hpaulj solution suggested in this link.

def max_n(row_data, row_indices, n): i = row_data.argsort()[-n:] # i = row_data.argpartition(-n)[-n:] top_values = row_data[i] top_indices = row_indices[i] # do the sparse indices matter? return top_values, top_indices, i

然后将其应用于一行以进行测试:

and then I applied it to one row for testing:

top_v, top_ind, ind = max_n(np.array(arr_ll.data[0]),np.array(arr_ll.rows[0]),5)

我需要的是top_indices，它们是5种最相似产品的索引，但是这些索引不是真正的product_id.构造coo_matrix

What I need is the top_indices which are the indices of the 5 most similar products, but those indices are not the real product_id. I mapped them when I constructed the coo_matrix

rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)

但是如何从索引中获取真实的product_id?

But how to get the real product_id back from the indices?

例如，现在我有:

top_ind = [2 1 34 9 123]

如何知道2对应于什么product_id，1对应什么，等等?

How to know 2 correspond to what product_id, 1 to what, etc?

这篇关于使用项目相似性的csr_matrix可获得与项目X最为相似的项目，而无需将csr_matrix转换为密集矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用项目相似性的csr_matrix可获得与项目X最为相似的项目，而无需将csr_matrix转换为密集矩阵 [英] Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用项目相似性的csr_matrix可获得与项目X最为相似的项目，而无需将csr_matrix转换为密集矩阵 [英] Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭