在python中使用sklearn为n-gram计算TF-IDF [英] Calculate TF-IDF using sklearn for n-grams in python

查看:33
本文介绍了在python中使用sklearn为n-gram计算TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 n-gram 的词汇表,如下所示.

I have a vocabulary list that include n-grams as follows.

myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']

我想用这些词来计算 TF-IDF 值.

I want to use these words to calculate TF-IDF values.

我还有一个语料字典如下(键=菜谱号,值=菜谱)

I also have a dictionary of corpus as follows (key = recipe number, value = recipe).

corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}

我目前正在使用以下代码.

I am currently using the following code.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

现在我在 corpus 中打印配方 1 的标记或 n-gram 以及 tF-IDF 值,如下所示.

Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.

feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print(w, s)

我得到的结果是chocolates 1.0.但是,我的代码在计算 TF-IDF 值时没有检测到 n-gram(bigrams),例如 biscuit pudding.请让我知道我在哪里弄错了代码.

The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.

我想通过使用 corpus 中的配方文档来获取 myvocabulary 术语的 TD-IDF 矩阵.换句话说,矩阵的行代表myvocabulary,矩阵的列代表我的corpus的配方文档.请帮帮我.

I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.

推荐答案

尝试增加 TfidfVectorizer 中的 ngram_range :

Try increasing the ngram_range in TfidfVectorizer:

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))

TfidfVectorizer 的输出是稀疏格式的 TF-IDF 矩阵(或者实际上是您寻求的格式的转置).您可以打印其内容,例如像这样:

The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
    print((feature_names[col], corpus_index[row]), tfs[row, col])

哪个应该产生

('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944

如果矩阵不大,以密集形式检查它可能更容易.Pandas 使这非常方便:

If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:

import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)

结果是

                        1         2         3
tim tam          0.000000  0.861037  0.000000
jam              0.000000  0.000000  0.000000
fresh milk       0.000000  0.000000  0.861037
chocolates       0.763228  0.508542  0.508542
biscuit pudding  0.646129  0.000000  0.000000

这篇关于在python中使用sklearn为n-gram计算TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆