了解 TfidfVectorizer 输出 [英] Understanding TfidfVectorizer output
问题描述
我正在用简单的示例测试 TfidfVectorizer
,但我无法弄清楚结果.
I'm testing TfidfVectorizer
with simple example, and I can't figure out the results.
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.get_feature_names())
print(tfidf.shape)
print(tfidf)
输出:
['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
(0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
...
我正在计算第一句话的 tfidf
并且得到不同的结果:
I'm calculating the tfidf
of the first sentence and I'm getting different results:
- 第一个文档(
I'd like an apple
")仅包含 2 个词(去除停用词后(根据vect.get_feature_names() 的打印)
代码>(我们留下来:like
"、apple
") - TF(苹果", Doucment_1) = 1/2 = 0.5
- TF("like", Doucment_1) = 1/2 = 0.5
apple
这个词在语料库中出现了 3 次.- 词
like
在语料库中出现了1次. - IDF(苹果")= ln(5/3) = 0.51082
- IDF(喜欢")= ln(5/1) = 1.60943
- The first document ("
I'd like an apple
") contains just 2 words (after removeing stop words (according to the print ofvect.get_feature_names()
(we stay with: "like
", "apple
") - TF("apple", Doucment_1) = 1/2 = 0.5
- TF("like", Doucment_1) = 1/2 = 0.5
- The word
apple
appears 3 times in the corpus. - The word
like
appears 1 time in the corpus. - IDF ("apple") = ln(5/3) = 0.51082
- IDF ("like") = ln(5/1) = 1.60943
所以:
tfidf(apple")
in document1 = 0.5 * 0.51082 = 0.255 != 0.5564tfidf("like")
in document1 = 0.5 * 1.60943 = 0.804 != 0.8308
tfidf("apple")
in document1 = 0.5 * 0.51082 = 0.255 != 0.5564tfidf("like")
in document1 = 0.5 * 1.60943 = 0.804 != 0.8308
我错过了什么?
推荐答案
您的计算有几个问题.
首先,关于如何计算 TF 有多种约定(参见 维基百科条目);scikit-learn 不会不用文档长度对其进行标准化.来自用户指南:
First, there are multiple conventions on how to calculate TF (see the Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:
[...] 词频,一个词在给定文档中出现的次数 [...]
[...] the term frequency, the number of times a term occurs in a given document [...]
所以,在这里,TF(apple", Document_1) = 1
,而不是 0.5
So, here, TF("apple", Document_1) = 1
, and not 0.5
第二,关于 IDF 定义 - 来自 文档:
Second, regarding the IDF definition - from the docs:
如果 smooth_idf=True
(默认值),常量1"被添加到 idf 的分子和分母上,就像看到一个额外的文档,包含集合中的每个术语恰好一次,防止零分割:idf(t) = log [ (1 + n)/(1 + df(t)) ] + 1.
If
smooth_idf=True
(the default), the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.
所以,我们将有
IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644
因此
TF-IDF("apple") = 1 * 1.4054651081081644 = 1.4054651081081644
第三,在默认设置 norm='l2'
下,会发生额外的标准化;再次来自文档:
Third, with the default setting norm='l2'
, there is an extra normalization taking place; from the docs again:
norm='l2'
时归一化为c"(余弦),norm=None
时为n"(无).
Normalization is "c" (cosine) when
norm='l2'
, "n" (none) whennorm=None
.
从您的示例中明确删除这种额外的规范化,即
Explicitly removing this extra normalization from your example, i.e.
vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)
给 'apple'
(0, 0) 1.4054651081081644
即已手动计算
有关在 norm='l2'
(默认设置)时标准化如何影响计算的详细信息,请参阅 用户指南的Tf–idf术语权重部分;他们自己承认:
For the details of how exactly the normalization affects the calculations when norm='l2'
(the default setting), see the Tf–idf term weighting section of the user guide; by their own admission:
在 scikit-learn 的 TfidfTransformer
和 TfidfVectorizer
中计算的 tf-idfs 与标准教科书符号略有不同
the tf-idfs computed in scikit-learn’s
TfidfTransformer
andTfidfVectorizer
differ slightly from the standard textbook notation
这篇关于了解 TfidfVectorizer 输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!