从numpy python中的稀疏矩阵生成密集矩阵 [英] Generating a dense matrix from a sparse matrix in numpy python

查看:74
本文介绍了从numpy python中的稀疏矩阵生成密集矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下类型架构的 Sqlite 数据库:

I have a Sqlite database that contains following type of schema:

termcount(doc_num, term , count)

此表包含在文档中具有各自计数的术语.喜欢

This table contains terms with their respective counts in the document. like

(doc1 , term1 ,12)
(doc1, term 22, 2)
.
.
(docn,term1 , 10)

这个矩阵可以被认为是稀疏矩阵,因为每个文档都包含非常少的非零值项.

This matrix can be considered as sparse matrix as each documents contains very few terms that will have a non-zero value.

我将如何使用 numpy 从这个稀疏矩阵创建一个密集矩阵,因为我必须使用余弦相似度计算文档之间的相似度.

How would I create a dense matrix from this sparse matrix using numpy as I have to calculate the similarity among documents using cosine similarity.

这个密集矩阵看起来像一个表格,第一列是 docid,所有的词都列在第一行.其余的单元格将包含计数.

This dense matrix will look like a table that have docid as the first column and all the terms will be listed as the first row.and remaining cells will contain counts.

推荐答案

我使用 Pandas 解决了这个问题.因为我们要保留文档 ID 和术语 ID.

I solved this problem using Pandas. Because we want to keep the document ids and term ids.

from pandas import DataFrame 

# A sparse matrix in dictionary form (can be a SQLite database). Tuples contains doc_id        and term_id. 
doc_term_dict={('d1','t1'):12, ('d2','t3'):10, ('d3','t2'):5}

#extract all unique documents and terms ids and intialize a empty dataframe.
rows = set([d for (d,t) in doc_term_dict.keys()])  
cols = set([t for (d,t) in doc_term_dict.keys()])
df = DataFrame(index = rows, columns = cols )
df = df.fillna(0)

#assign all nonzero values in dataframe
for key, value in doc_term_dict.items():
    df[key[1]][key[0]] = value   

print df

输出:

    t2  t3  t1
d2  0  10   0
d3  5   0   0
d1  0   0  12

这篇关于从numpy python中的稀疏矩阵生成密集矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆