为什么TF-IDF的值与IDF_不同? [英] Why is the value of TF-IDF different from IDF_?
问题描述
为什么矢量化语料库的值与通过idf_
属性获得的值不同? idf_
属性是否不应该以与矢量化语料库中出现的相同方式返回文档反向频率(IDF)?
Why is the value of the vectorized corpus different from the value obtained through the idf_
attribute? Should not the idf_
attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized?
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer()
corpus = vectorizer.fit_transform(corpus)
print(corpus)
Corpus向量化:
Corpus vectorized:
(0, 2) 0.6300993445179441
(0, 4) 0.44832087319911734
(0, 0) 0.44832087319911734
(0, 3) 0.44832087319911734
(1, 1) 0.6300993445179441
(1, 4) 0.44832087319911734
(1, 0) 0.44832087319911734
(1, 3) 0.44832087319911734
词汇和idf_
值:
print(dict(zip(vectorizer.vocabulary_, vectorizer.idf_)))
输出:
{'this': 1.0,
'is': 1.4054651081081644,
'very': 1.4054651081081644,
'strange': 1.0,
'nice': 1.0}
词汇量索引:
print(vectorizer.vocabulary_)
输出:
{'this': 3,
'is': 0,
'very': 4,
'strange': 2,
'nice': 1}
为什么在语料库中单词this
的IDF值是0.44
,而通过idf_
获得的1.0
是1.0
呢?
Why is the IDF value of the word this
is 0.44
in the corpus and 1.0
when obtained by idf_
?
推荐答案
这是由于l2
规范化,默认情况下在TfidfVectorizer()
中应用.
如果将norm
参数设置为None
,则将获得与idf_
相同的值.
This is because of l2
normalization, which is applied by default in TfidfVectorizer()
.
If you set the norm
param as None
, you will get the same values as idf_
.
>>> vectorizer = TfidfVectorizer(norm=None)
#output
(0, 2) 1.4054651081081644
(0, 4) 1.0
(0, 0) 1.0
(0, 3) 1.0
(1, 1) 1.4054651081081644
(1, 4) 1.0
(1, 0) 1.0
(1, 3) 1.0
此外,计算功能的相应idf值的方法是错误的,因为dict
不会保留顺序.
Also, your way to computing the feature's corresponding idf values is wrong because dict
does not preserve the order.
使用:
>>>> print(dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))
{'is': 1.0,
'nice': 1.4054651081081644,
'strange': 1.4054651081081644,
'this': 1.0,
'very': 1.0}
这篇关于为什么TF-IDF的值与IDF_不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!