在scikit-learn tf-idf矩阵中获取文档名称 [英] Get the document name in scikit-learn tf-idf matrix

查看:104
本文介绍了在scikit-learn tf-idf矩阵中获取文档名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经创建了一个TF-IDF矩阵,但现在我想检索顶部2个字的每个文档.我想通过文件ID,它应该给我顶2个字.

I have created a tf-idf matrix but now I want to retrieve top 2 words for each document. I want to pass document id and it should give me the top 2 words.

现在,我有这样的示例数据:

Right now, I have this sample data:

from sklearn.feature_extraction.text import TfidfVectorizer

d = {'doc1':"this is the first document",'doc2':"it is a sunny day"} ### corpus

test_v = TfidfVectorizer(min_df=1)    ### applied the model
t = test_v.fit_transform(d.values())
feature_names = test_v.get_feature_names() ### list of words/terms

>>> feature_names
['day', 'document', 'first', 'is', 'it', 'sunny', 'the', 'this']

>>> t.toarray()
array([[ 0.        ,  0.47107781,  0.47107781,  0.33517574,  0.        ,
     0.        ,  0.47107781,  0.47107781],
   [ 0.53404633,  0.        ,  0.        ,  0.37997836,  0.53404633,
     0.53404633,  0.        ,  0.        ]])

我可以通过给行号例如访问矩阵.

I can access the matrix by giving the row number eg.

 >>> t[0,1]
   0.47107781233161794

有没有一种方法可以通过文档ID访问此矩阵?在我的情况 'DOC1' 和 'DOC2'.

Is there a way I can be able to access this matrix by document id? In my case 'doc1' and 'doc2'.

谢谢

推荐答案

通过这样做

t = test_v.fit_transform(d.values())

您将失去指向文档ID的任何链接.一个字典是没有下令所以你不知道该值是在顺序给出.的装置,其传递值到fit_transform功能之前需要记录其值对应于ID.

you lose any link to the document ids. A dict is not ordered so you have no idea which value is given in which order. The means that before passing the values to the fit_transform function you need to record which value corresponds to which id.

例如你可以做的是:

counter = 0
values = []
key = {}


for k,v in d.items():
    values.append(v)
    key[k] = counter
    counter+=1

t = test_v.fit_transform(values)

从那里,你可以建立一个函数由文件ID访问此MATIX:

From there you can build a function to access this matix by document id:

def get_doc_row(docid):
    rowid = key[docid]
    row = t[rowid,:]
    return row

这篇关于在scikit-learn tf-idf矩阵中获取文档名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆