如何在给定语料的情况下构建TFIDF Vectorizer，并使用Sklearn比较其结果? [英] How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

查看：131 发布时间：2020/7/11 0:39:06 python scikit-learn tf-idf tfidfvectorizer

本文介绍了如何在给定语料的情况下构建TFIDF Vectorizer，并使用Sklearn比较其结果?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Sklearn在其TFIDF矢量化器版本的实现中做了一些调整，因此要复制确切的结果，您需要在自定义的tfidf矢量化器实现中添加以下内容:

Sklearn does few tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer:

Sklearn的词汇表是根据idf按字母顺序排列的
idf的Sklearn公式与标准教科书公式不同.在这里，常数"1"被添加到idf的分子和分母，就好像看到一个额外的文档中，集合中的每个术语恰好包含一次一样，这防止了零除. IDF(t)=1+(loge((1 + Total number of documents in collection)/(1+Number of documents with term t in it)).
Sklearn将 L2标准化应用于其输出矩阵.
sklearn tfidf矢量化器的最终输出是稀疏矩阵.

Sklearn has its vocabulary generated from idf sroted in alphabetical order
Sklearn formula of idf is different from the standard textbook formula. Here the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions. IDF(t)=1+(loge((1 + Total number of documents in collection)/(1+Number of documents with term t in it)).
Sklearn applies L2-normalization to its output matrix.
The final output of sklearn tfidf vectorizer is a sparse matrix.

现在给出以下语料库:

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

Sklearn实现:

Sklearn implementation:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)
print(vectorizer.get_feature_names())
output : [‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]

print(skl_output[0])

输出:

(0, 8)    0.38408524091481483
(0, 6)    0.38408524091481483
(0, 3)    0.38408524091481483
(0, 2)    0.5802858236844359
(0, 1)    0.46979138557992045

我需要使用自定义实现来复制上述结果，即用简单的python编写代码.

I need to replicate the above result using a custom implementation i.e write code in simple python.

我编写了以下代码:

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy


# The fit function helps in creating a vocabulary of all the unique words in the corpus


def fit(dataset):
  storage_set = set()
  if isinstance(dataset,list):
    for document in dataset:
      for word in document.split(" "):
        storage_set.add(word)
  storage_set = sorted(list(storage_set))
  vocab = {j:i for i,j in enumerate(storage_set)}
  return vocab

vocab =  fit(corpus)
print(vocab)
output : {‘and’: 0, ‘document’: 1, ‘first’: 2, ‘is’: 3, ‘one’: 4, ‘second’: 5, ‘the’: 6, ‘third’: 7, ‘this’: 8}
This output is matching with the output of the sklearn above
#Returs a sparse matrix of the all non-zero values along with their row and col 
def transform(dataset,vocab):
  row = []
  col = []
  values = []
  for ibx,document in enumerate(dataset):
    word_freq = dict(Counter(document.split()))
    for word, freq in word_freq.items():
      col_index = vocab.get(word,-1)
      if col_index != -1:
        if len(word)<2:
          continue
        col.append(col_index)
        row.append(ibx)
        td = freq/float(len(document)) # the number of times a word occured in a document
        idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
        values.append((td) * (idf_))
    return normalize(csr_matrix( ((values),(row,col)), shape=(len(dataset),len(vocab))),norm='l2' )

print(transform(corpus,vocab))

输出:

(0, 1)    0.3989610517704845
(0, 2)    0.602760579899478
(0, 3)    0.3989610517704845
(0, 6)    0.3989610517704845
(0, 8)    0.3989610517704845

如您所见，此输出与sklearn的输出中的值不匹配.我经过了几次逻辑，尝试到处调试.但是，找不到我的自定义实现与sklearn的输出不匹配的原因. 不胜感激.

As you can see this output is not matching with the values from the sklearn’s output. I went through the logic several times, tried debugging everywhere. However, couldn’t locate why my custom implementation is not matching the output by sklearn. Would appreciate any insights.

如何在给定语料的情况下构建TFIDF Vectorizer，并使用Sklearn比较其结果? [英] How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在给定语料的情况下构建TFIDF Vectorizer，并使用Sklearn比较其结果? [英] How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭