为什么TF-IDF的值与IDF_不同? [英] Why is the value of TF-IDF different from IDF_?

查看:97
本文介绍了为什么TF-IDF的值与IDF_不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么矢量化语料库的值与通过idf_属性获得的值不同? idf_属性是否不应该以与矢量化语料库中出现的相同方式返回文档反向频率(IDF)?

Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized?

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer()
corpus = vectorizer.fit_transform(corpus)

print(corpus)

Corpus向量化:

Corpus vectorized:

  (0, 2)    0.6300993445179441
  (0, 4)    0.44832087319911734
  (0, 0)    0.44832087319911734
  (0, 3)    0.44832087319911734
  (1, 1)    0.6300993445179441
  (1, 4)    0.44832087319911734
  (1, 0)    0.44832087319911734
  (1, 3)    0.44832087319911734

词汇和idf_值:

print(dict(zip(vectorizer.vocabulary_, vectorizer.idf_)))

输出:

{'this': 1.0, 
 'is': 1.4054651081081644, 
 'very': 1.4054651081081644, 
 'strange': 1.0, 
 'nice': 1.0}

词汇量索引:

print(vectorizer.vocabulary_)

输出:

{'this': 3, 
 'is': 0, 
 'very': 4, 
 'strange': 2, 
 'nice': 1}

为什么在语料库中单词this的IDF值是0.44,而通过idf_获得的1.01.0呢?

Why is the IDF value of the word this is 0.44 in the corpus and 1.0 when obtained by idf_?

推荐答案

这是由于l2规范化,默认情况下在TfidfVectorizer()中应用. 如果将norm参数设置为None,则将获得与idf_相同的值.

This is because of l2 normalization, which is applied by default in TfidfVectorizer(). If you set the norm param as None, you will get the same values as idf_.


>>> vectorizer = TfidfVectorizer(norm=None)

#output

  (0, 2)    1.4054651081081644
  (0, 4)    1.0
  (0, 0)    1.0
  (0, 3)    1.0
  (1, 1)    1.4054651081081644
  (1, 4)    1.0
  (1, 0)    1.0
  (1, 3)    1.0

此外,计算功能的相应idf值的方法是错误的,因为dict不会保留顺序.

Also, your way to computing the feature's corresponding idf values is wrong because dict does not preserve the order.

使用:

 >>>> print(dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))

     {'is': 1.0,
      'nice': 1.4054651081081644, 
      'strange': 1.4054651081081644, 
      'this': 1.0, 
      'very': 1.0}

这篇关于为什么TF-IDF的值与IDF_不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆