在python中使用sklearn计算n-gram的TF-IDF [英] Calculate TF-IDF using sklearn for n-grams in python

查看:920
本文介绍了在python中使用sklearn计算n-gram的TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含n-gram的词汇表,如下所示.

I have a vocabulary list that include n-grams as follows.

myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']

我想用这些词来计算TF-IDF值.

I want to use these words to calculate TF-IDF values.

我还有如下的语料库字典(键=配方编号,值=配方).

I also have a dictionary of corpus as follows (key = recipe number, value = recipe).

corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}

我当前正在使用以下代码.

I am currently using the following code.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

现在我要在corpus中打印配方1的令牌或n-gram以及tF-IDF值,如下所示.

Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.

feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print(w, s)

我得到的结果是chocolates 1.0.但是,我的代码在计算TF-IDF值时未检测到biscuit pudding之类的n-grams.请让我知道我在哪里弄错了代码.

The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.

我想通过使用corpus中的配方文档来获取myvocabulary项的TD-IDF矩阵.换句话说,矩阵的行代表myvocabulary,矩阵的列代表我的corpus的配方文档.请帮助我.

I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.

推荐答案

尝试增加TfidfVectorizer中的ngram_range:

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))

TfidfVectorizer的输出是稀疏格式的TF-IDF矩阵(或者实际上是您寻求的格式的转置).您可以打印出其内容,例如像这样:

The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
    print((feature_names[col], corpus_index[row]), tfs[row, col])

应该产生

('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944

如果矩阵不大,则以密集形式检查它可能会更容易. Pandas使此操作非常方便:

If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:

import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)

这导致

                        1         2         3
tim tam          0.000000  0.861037  0.000000
jam              0.000000  0.000000  0.000000
fresh milk       0.000000  0.000000  0.861037
chocolates       0.763228  0.508542  0.508542
biscuit pudding  0.646129  0.000000  0.000000

这篇关于在python中使用sklearn计算n-gram的TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆