使用vocabulary_id和相应的tfidf得分将文本语料库转换为文本文档 [英] converting a text corpus to a text document with vocabulary_id and respective tfidf score
问题描述
我有一个包含5个文档的文本语料库,每个文档之间都用/n分隔.我想为文档中的每个单词提供一个ID,并计算其各自的tfidf
得分.
例如,假设我们有一个名为"corpus.txt"的文本语料库,如下所示:-
I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfidf
score.
for example, suppose we have a text corpus named "corpus.txt" as follows:-
堆栈 溢流 文本向量化scikit python scipy sparse csr 在使用
"Stack over flow text vectorization scikit python scipy sparse csr" while calculating the tfidf using
mylist =list("corpus.text")
vectorizer= CountVectorizer
x_counts = vectorizer_train.fit_transform(mylist)
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)
输出为
(0,12) 0.1234 #for 1st document
(1,8) 0.3456 #for 2nd document
(1,4) 0.8976
(2,15) 0.6754 #for third document
(2,14) 0.2389
(2,3) 0.7823
(3,11) 0.9897 #for fourth document
(3,13) 0.8213
(3,5) 0.7722
(3,6) 0.2211
(4,7) 0.1100 # for fifth document
(4,10) 0.6690
(4,2) 0.0912
(4,9) 0.2345
(4,1) 0.1234
我将此scipy.sparse.csr
矩阵转换为列表列表,以删除文档ID,并使用以下方法仅保留vocabulary_id及其相应的tfidf
分数:
I converted this scipy.sparse.csr
matrix into a list of lists to remove the document id, and keeping only the vocabulary_id and its respective tfidf
score using:
m = x_tfidf.tocoo()
mydata = {k: v for k, v in zip(m.col, m.data)}
key_val_pairs = [str(k) + ":" + str(v) for k, v in mydata.items()]
但是问题是我得到了一个输出,其中vocabulary_id及其相应的tfidf
得分以升序排列,而没有引用任何文档.
but the problem is that I am getting an output where the vocabulary_id and its respective tfidf
score is arranged in ascending order and without any reference to document.
例如,对于上述给定的语料库,我当前的输出(我已使用json转储到文本文件中)如下:
For example, for the above given corpus my current output(I have dumped into a text file using json) looks like:
1:0.1234
2:0.0912
3:0.7823
4:0.8976
5:0.7722
6:0.2211
7:0.1100
8:0.3456
9:0.2345
10:0.6690
11:0.9897
12:0.1234
13:0.8213
14:0.2389
15:0.6754
而我希望我的文本文件如下所示:
whereas I would have want my text file to be like as follows:
12:0.1234
8:0.3456 4:0.8976
15:0.1234 14:0.2389 3:0.7823
11:0.9897 13:0.8213 5:0.7722 6:0.2211
7:0.1100 10:0.6690 2:0.0912 9:0.2345 1:0.1234
知道如何完成吗?
推荐答案
我想这就是您所需要的.这里corpus
是文档的集合.
I guess this is what you need. Here corpus
is a collection of documents.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["stack over flow stack over flow text vectorization scikit", "stack over flow"]
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(corpus) # corpus is a collection of documents
print(vectorizer.vocabulary_) # vocabulary terms and their index
print(x) # tf-idf weights for each terms belong to a particular document
此打印:
{'vectorization': 5, 'text': 4, 'over': 1, 'flow': 0, 'stack': 3, 'scikit': 2}
(0, 2) 0.33195438857 # first document, word = scikit
(0, 5) 0.33195438857 # word = vectorization
(0, 4) 0.33195438857 # word = text
(0, 0) 0.472376562969 # word = flow
(0, 1) 0.472376562969 # word = over
(0, 3) 0.472376562969 # word = stack
(1, 0) 0.57735026919 # second document
(1, 1) 0.57735026919
(1, 3) 0.57735026919
从此信息中,您可以按所需方式表示文档,如下所示:
From this information, you can represent the documents in your desired way as following:
cx = x.tocoo()
doc_id = -1
for i,j,v in zip(cx.row, cx.col, cx.data):
if doc_id == -1:
print(str(j) + ':' + "{:.4f}".format(v), end=' ')
else:
if doc_id != i:
print()
print(str(j) + ':' + "{:.4f}".format(v), end=' ')
doc_id = i
此打印:
2:0.3320 5:0.3320 4:0.3320 0:0.4724 1:0.4724 3:0.4724
0:0.5774 1:0.5774 3:0.5774
这篇关于使用vocabulary_id和相应的tfidf得分将文本语料库转换为文本文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!