tf-idf 特征权重使用 sklearn.feature_extraction.text.TfidfVectorizer [英] tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer
问题描述
这个页面:http://scikit-learn.org/stable/modules/feature_extraction.html 提及:
由于 tf–idf 经常用于文本特征,因此还有另一个名为 TfidfVectorizer 的类,它结合了 CountVectorizer 和 TfidfTransformer<的所有选项/strong> 在单个模型中.
As tf–idf is a very often used for text features, there is also another class called TfidfVectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model.
然后我按照代码在我的语料库中使用 fit_transform() .如何获得fit_transform()计算的每个特征的权重?
then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?
我试过了:
In [39]: vectorizer.idf_
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-5475eefe04c0> in <module>()
----> 1 vectorizer.idf_
AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'
但是缺少此属性.
谢谢
推荐答案
从 0.15 版本开始,可以通过 TfidfVectorizer<的属性
idf_
检索每个特征的 tf-idf 分数/code> 对象:
Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_
of the TfidfVectorizer
object:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
输出:
{u'is': 1.0,
u'nice': 1.4054651081081644,
u'strange': 1.4054651081081644,
u'this': 1.0,
u'very': 1.0}
<小时>
正如评论中所讨论的,在 0.15 版本之前,一种解决方法是通过假定隐藏的 _tfidf
(TfidfTransformer
) 的矢量化器:
idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
应该给出与上面相同的输出.
which should give the same output as above.
这篇关于tf-idf 特征权重使用 sklearn.feature_extraction.text.TfidfVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!