如何从SKLearn的TfidfVectorizer手动计算TF-IDF分数 [英] How to manually calculate TF-IDF score from SKLearn's TfidfVectorizer

查看:141
本文介绍了如何从SKLearn的TfidfVectorizer手动计算TF-IDF分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在运行SKLearn的TF-IDF Vectorizer,但无法手动重新创建值(以帮助了解正在发生的事情).

I have been running the TF-IDF Vectorizer from SKLearn but am having trouble recreating the values manually (as an aid to understanding what is happening).

要添加一些上下文,我有一些文档列表,这些文档是我从中提取命名实体的(在我的实际数据中,这些文档的大小为5克,但在此我将其限制为双字母组).我只想知道这些值的TF-IDF分数,并认为通过vocabulary参数传递这些术语会做到这一点.

To add some context, i have a list of documents that I have extracted named entities from (in my actual data these go up to 5-grams but here I have restricted this to bigrams). I only want to know the TF-IDF scores for these values and thought passing these terms via the vocabulary parameter would do this.

以下是一些虚拟数据,类似于我正在使用的数据:

Here is some dummy data similar to what I am working with:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd    


# list of named entities I want to generate TF-IDF scores for
named_ents = ['boston','america','france','paris','san francisco']

# my list of documents
docs = ['i have never been to boston',
    'boston is in america',
    'paris is the capitol city of france',
    'this sentence has no named entities included',
    'i have been to san francisco and paris']

# find the max nGram in the named entity vocabulary
ne_vocab_split = [len(word.split()) for word in named_ents]
max_ngram = max(ne_vocab_split)

tfidf = TfidfVectorizer(vocabulary = named_ents, stop_words = None, ngram_range=(1,max_ngram))
tfidf_vector = tfidf.fit_transform(docs)

output = pd.DataFrame(tfidf_vector.T.todense(), index=named_ents, columns=docs)

注意:我知道默认情况下会删除停用词,但是实际数据集中的某些命名实体包含诸如国务院"之类的短语.所以他们一直被关在这里.

Note: I know stop-words are removed by default, but some of the named entities in my actual data-set include phrases such as 'the state department'. So they have been kept here.

这是我需要帮助的地方.我的理解是,我们按以下方式计算TF-IDF:

Here is where I need some help. I'm of the understanding that we calculate the TF-IDF as follows:

TF:词频:根据是术语在给定文档中出现的次数"

TF: term frequency: which according to SKlearn guidelines the is "the number of times a term occurs in a given document"

IDF:逆文档频率:1+文档数量与1+包含该术语的文档数量之比的自然对数.根据链接中的相同准则,结果值加1以防止被零除.

IDF: inverse document frequency: the natural log of the ratio of 1+the number of documents, and 1+the number of documents containing the term. According to the same guidelines in the link, the resultant value has a 1 added to prevent division by zero.

然后,将 TF IDF 相乘,得出给定文档中给定术语的整体 TF-IDF .

We then multiply the TF by the IDF to give the overall TF-IDF for the a given term, in a given document.

示例

让我们以第一列为例,该列只有一个命名实体"Boston",并且根据上面的代码,在第一个文档1上具有TF-IDF.但是,当我手动进行计算时,我得到了以下:

Let's take the first column as an example, which has only one named entity 'Boston', and according to the above code has a TF-IDF on the first document of 1. However, when I work this out manually I get the following:

TF = 1

IDF = log-e(1+total docs / 1+docs with 'boston') + 1
' ' = log-e(1+5 / 1+2) + 1
' ' = log-e(6 / 3) + 1
' ' = log-e(2) + 1
' ' = 0.69314 + 1
' ' = 1.69314

TF-IDF = 1 * 1.69314 = 1.69314 (not 1)

也许我在文档中遗漏了一些分数最高不超过1的内容,但是我无法弄清楚哪里出了问题.此外,通过上述计算,第一栏中的波士顿分数与第二栏中的分数应该没有任何区别,因为该术语在每个文档中仅出现一次.

Perhaps I'm missing something in the documentation that says scores are capped at 1, but I cannot work out where I've gone wrong. Furthermore, with the above calculations, there shouldn't be any difference between the score for Boston in the first column, and the second column, as the term only appears once in each document.

编辑 在发布问题之后,我认为也许术语频率是通过与文档中的字母组合数或文档中的命名实体数之比计算得出的.例如,在第二份文档中,SKlearn为Boston生成的分数为0.627914.如果我将TF表示为令牌的比率='波士顿'(1):所有unigram令牌(4),我都会得到TF 0.25,当我将其应用于TF-IDF时,其得分将刚好超过0.147 .

Edit After posting the question I thought that maybe the Term Frequency was calculated as a ratio with either the number of unigrams in the document, or the number of named entities in the document. For example, in the second document SKlearn generates a score for Boston of 0.627914. If I calculate the TF as a ratio of tokens = 'boston' (1) : all unigram tokens (4) I get a TF of 0.25, which when I apply to the TF-IDF returns a score just over 0.147.

类似地,当我使用令牌的比率='波士顿'(1)时:所有NE令牌(2)并应用TF-IDF,我得到的得分为0.846.很明显,我在某处出错.

Similarly, when I use a ratio of tokens = 'boston' (1) : all NE tokens (2) and apply the TF-IDF I get a score of 0.846. So clearly I am going wrong somewhere.

推荐答案

让我们一次一次地执行此数学练习.

Let's do this this mathematical exercise one step at a time.

步骤1.获取boston令牌的tfidf分数

Step 1. Get tfidf scores for boston token

docs = ['i have never been to boston',
        'boston is in america',
        'paris is the capitol city of france',
        'this sentence has no named entities included',
        'i have been to san francisco and paris']

from sklearn.feature_extraction.text import TfidfVectorizer

# I did not include your named_ents here but did for a full vocab 
tfidf = TfidfVectorizer(smooth_idf=True,norm='l1')

请注意TfidfVectorizer中的参数,它们对于以后的平滑和规范化非常重要.

Note the params in TfidfVectorizer, they are important for smoothing and normalization later.

docs_tfidf = tfidf.fit_transform(docs).todense()
n = tfidf.vocabulary_["boston"]
docs_tfidf[:,n]
matrix([[0.19085885],
        [0.22326669],
        [0.        ],
        [0.        ],
        [0.        ]])

到目前为止,tfidf为boston令牌(vocab中的#3)评分.

What we've got so far, tfidf scores for boston token (#3 in vocab).

第2步:计算boston代币(不包含规范)的tfidf.

Step 2.Calculate tfidf for boston token w/o norm.

公式为:

tf-idf(t,d)= tf(t,d)* idf(t)
idf(t)= log((n + 1)/(df(t)+1))+ 1
其中:
-tf(t,d)-文档d
中的简单术语t频率 -idf(t)-平滑反转的文档频率(由于smooth_idf=True参数)

tf-idf(t, d) = tf(t, d) * idf(t)
idf(t) = log( (n+1) / (df(t)+1) ) + 1
where:
- tf(t,d) -- simple term t frequency in document d
- idf(t) -- smoothed inversed document frequency (because of smooth_idf=True param)

计算第0个文档中的令牌boston及其在以下文档中显示的文档数:

Counting the token boston in 0th document and # of documents it appears in:

tfidf_boston_wo_norm = ((1/5) * (np.log((1+5)/(1+2))+1))
tfidf_boston_wo_norm
0.3386294361119891

请注意,根据内置的标记化方案,i不会计为标记.

Note, i does not count as a token according to builtin tokenization scheme.

第3步.规范化

让我们先进行l1归一化,即所有计算出的未归一化的tfdid的总和应按行总计为1:

Let's do l1 normalization first, i.e. all calculated non-normalized tfdid's should sum up to 1 by row:

l1_norm = ((1/5) * (np.log((1+5)/(1+2))+1) +
         (1/5) * (np.log((1+5)/(1+1))+1) +
         (1/5) * (np.log((1+5)/(1+2))+1) +
         (1/5) * (np.log((1+5)/(1+2))+1) +
         (1/5) * (np.log((1+5)/(1+2))+1))
tfidf_boston_w_l1_norm = tfidf_boston_wo_norm/l1_norm
tfidf_boston_w_l1_norm 
0.19085884520912985

如您所见,我们的tfidf得分与上述相同.

As you see, we are getting the same tfidf score as above.

现在让我们对l2准则进行相同的数学运算.

Let's now do the same math for l2 norm.

基准:

tfidf = TfidfVectorizer(sublinear_tf=True,norm='l2')
docs_tfidf = tfidf.fit_transform(docs).todense()
docs_tfidf[:,n]
matrix([[0.42500138],
        [0.44400208],
        [0.        ],
        [0.        ],
        [0.        ]])

微积分:

l2_norm = np.sqrt(((1/5) * (np.log((1+5)/(1+2))+1))**2 +
                  ((1/5) * (np.log((1+5)/(1+1))+1))**2 +
                  ((1/5) * (np.log((1+5)/(1+2))+1))**2 +
                  ((1/5) * (np.log((1+5)/(1+2))+1))**2 +
                  ((1/5) * (np.log((1+5)/(1+2))+1))**2                
                 )

tfidf_boston_w_l2_norm = tfidf_boston_wo_norm/l2_norm
tfidf_boston_w_l2_norm 
0.42500137513291814

仍然与可能看到的相同.

It's still the same as a may see.

这篇关于如何从SKLearn的TfidfVectorizer手动计算TF-IDF分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆