了解 TfidfVectorizer 输出 [英] Understanding TfidfVectorizer output

查看:134
本文介绍了了解 TfidfVectorizer 输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用简单的示例测试 TfidfVectorizer,但我无法弄清楚结果.

I'm testing TfidfVectorizer with simple example, and I can't figure out the results.

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)

print(vect.get_feature_names())    
print(tfidf.shape)
print(tfidf)

输出:

['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
  (0, 0)    0.5564505207186616
  (0, 9)    0.830880748357988
  ...

我正在计算第一句话的 tfidf 并且得到不同的结果:

I'm calculating the tfidf of the first sentence and I'm getting different results:

  • 第一个文档(I'd like an apple")仅包含 2 个词(去除停用词后(根据 vect.get_feature_names() 的打印)代码>(我们留下来:like"、apple")
  • TF(苹果", Doucment_1) = 1/2 = 0.5
  • TF("like", Doucment_1) = 1/2 = 0.5
  • apple 这个词在语料库中出现了 3 次.
  • like在语料库中出现了1次.
  • IDF(苹果")= ln(5/3) = 0.51082
  • IDF(喜欢")= ln(5/1) = 1.60943
  • The first document ("I'd like an apple") contains just 2 words (after removeing stop words (according to the print of vect.get_feature_names() (we stay with: "like", "apple")
  • TF("apple", Doucment_1) = 1/2 = 0.5
  • TF("like", Doucment_1) = 1/2 = 0.5
  • The word apple appears 3 times in the corpus.
  • The word like appears 1 time in the corpus.
  • IDF ("apple") = ln(5/3) = 0.51082
  • IDF ("like") = ln(5/1) = 1.60943

所以:

  • tfidf(apple") in document1 = 0.5 * 0.51082 = 0.255 != 0.5564
  • tfidf("like") in document1 = 0.5 * 1.60943 = 0.804 != 0.8308
  • tfidf("apple") in document1 = 0.5 * 0.51082 = 0.255 != 0.5564
  • tfidf("like") in document1 = 0.5 * 1.60943 = 0.804 != 0.8308

我错过了什么?

推荐答案

您的计算有几个问题.

首先,关于如何计算 TF 有多种约定(参见 维基百科条目);scikit-learn 不会用文档长度对其进行标准化.来自用户指南:

First, there are multiple conventions on how to calculate TF (see the Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:

[...] 词频,一个词在给定文档中出现的次数 [...]

[...] the term frequency, the number of times a term occurs in a given document [...]

所以,在这里,TF(apple", Document_1) = 1,而不是 0.5

So, here, TF("apple", Document_1) = 1, and not 0.5

第二,关于 IDF 定义 - 来自 文档:

Second, regarding the IDF definition - from the docs:

如果 smooth_idf=True(默认值),常量1"被添加到 idf 的分子和分母上,就像看到一个额外的文档,包含集合中的每个术语恰好一次,防止零分割:idf(t) = log [ (1 + n)/(1 + df(t)) ] + 1.

If smooth_idf=True (the default), the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

所以,我们将有

IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644

因此

TF-IDF("apple") = 1 * 1.4054651081081644 =  1.4054651081081644

第三,在默认设置 norm='l2' 下,会发生额外的标准化;再次来自文档:

Third, with the default setting norm='l2', there is an extra normalization taking place; from the docs again:

norm='l2' 时归一化为c"(余弦),norm=None 时为n"(无).

Normalization is "c" (cosine) when norm='l2', "n" (none) when norm=None.

从您的示例中明确删除这种额外的规范化,即

Explicitly removing this extra normalization from your example, i.e.

vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)

'apple'

(0, 0)  1.4054651081081644

即已手动计算

有关在 norm='l2'(默认设置)时标准化如何影响计算的详细信息,请参阅 用户指南的Tf–idf术语权重部分;他们自己承认:

For the details of how exactly the normalization affects the calculations when norm='l2' (the default setting), see the Tf–idf term weighting section of the user guide; by their own admission:

在 scikit-learn 的 TfidfTransformerTfidfVectorizer 中计算的 tf-idfs 与标准教科书符号略有不同

the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation

这篇关于了解 TfidfVectorizer 输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆