如何在给定语料的情况下构建TFIDF Vectorizer,并使用Sklearn比较其结果? [英] How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

查看:131
本文介绍了如何在给定语料的情况下构建TFIDF Vectorizer,并使用Sklearn比较其结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Sklearn在其TFIDF矢量化器版本的实现中做了一些调整,因此要复制确切的结果,您需要在自定义的tfidf矢量化器实现中添加以下内容:

Sklearn does few tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer:

  1. Sklearn的词汇表是根据idf按字母顺序排列的
  2. idf的Sklearn公式与标准教科书公式不同.在这里,常数"1"被添加到idf的分子和分母,就好像看到一个额外的文档中,集合中的每个术语恰好包含一次一样,这防止了零除. IDF(t)=1+(loge((1 + Total number of documents in collection)/(1+Number of documents with term t in it)).
  3. Sklearn将 L2标准化应用于其输出矩阵.
  4. sklearn tfidf矢量化器的最终输出是稀疏矩阵.
  1. Sklearn has its vocabulary generated from idf sroted in alphabetical order
  2. Sklearn formula of idf is different from the standard textbook formula. Here the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions. IDF(t)=1+(loge((1 + Total number of documents in collection)/(1+Number of documents with term t in it)).
  3. Sklearn applies L2-normalization to its output matrix.
  4. The final output of sklearn tfidf vectorizer is a sparse matrix.

现在给出以下语料库:

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

Sklearn实现:

Sklearn implementation:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)
print(vectorizer.get_feature_names())
output : [‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]

print(skl_output[0])

输出:

(0, 8)    0.38408524091481483
(0, 6)    0.38408524091481483
(0, 3)    0.38408524091481483
(0, 2)    0.5802858236844359
(0, 1)    0.46979138557992045

我需要使用自定义实现来复制上述结果,即用简单的python编写代码.

I need to replicate the above result using a custom implementation i.e write code in simple python.

我编写了以下代码:

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy
​

# The fit function helps in creating a vocabulary of all the unique words in the corpus
​

def fit(dataset):
  storage_set = set()
  if isinstance(dataset,list):
    for document in dataset:
      for word in document.split(" "):
        storage_set.add(word)
  storage_set = sorted(list(storage_set))
  vocab = {j:i for i,j in enumerate(storage_set)}
  return vocab

vocab =  fit(corpus)
print(vocab)
output : {‘and’: 0, ‘document’: 1, ‘first’: 2, ‘is’: 3, ‘one’: 4, ‘second’: 5, ‘the’: 6, ‘third’: 7, ‘this’: 8}
This output is matching with the output of the sklearn above
#Returs a sparse matrix of the all non-zero values along with their row and col 
def transform(dataset,vocab):
  row = []
  col = []
  values = []
  for ibx,document in enumerate(dataset):
    word_freq = dict(Counter(document.split()))
    for word, freq in word_freq.items():
      col_index = vocab.get(word,-1)
      if col_index != -1:
        if len(word)<2:
          continue
        col.append(col_index)
        row.append(ibx)
        td = freq/float(len(document)) # the number of times a word occured in a document
        idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
        values.append((td) * (idf_))
    return normalize(csr_matrix( ((values),(row,col)), shape=(len(dataset),len(vocab))),norm='l2' )

print(transform(corpus,vocab))

输出:

(0, 1)    0.3989610517704845
(0, 2)    0.602760579899478
(0, 3)    0.3989610517704845
(0, 6)    0.3989610517704845
(0, 8)    0.3989610517704845

如您所见,此输出与sklearn的输出中的值不匹配.我经过了几次逻辑,尝试到处调试.但是,找不到我的自定义实现与sklearn的输出不匹配的原因. 不胜感激.

As you can see this output is not matching with the values from the sklearn’s output. I went through the logic several times, tried debugging everywhere. However, couldn’t locate why my custom implementation is not matching the output by sklearn. Would appreciate any insights.

推荐答案

能否请您检查idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))中的idf(). 在尝试复制结果时,我的输出与sklearn的输出匹配,而没有对转换函数进行任何重大更改.所以我认为,您的idf()中肯定有任何问题,必须返回no.语料库中出现单词w的行的数量

Can you please check idf() in idf_ = 1+math.log((1+len(dataset))/float(1+idf(word))). While trying to replicate your results, my output matched with that of sklearn without doing any significant change in your transform function. So I think, there must be any problem in your idf() which must return the no. of rows in which the word w is present in the corpus

这篇关于如何在给定语料的情况下构建TFIDF Vectorizer,并使用Sklearn比较其结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆