分析器='char'如何计算Tf-Idf值? [英] How is the Tf-Idf value calculated with analyzer ='char'?

查看:18
本文介绍了分析器='char'如何计算Tf-Idf值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解我们如何在以下程序中获得 Tf-Idf 时遇到问题:

I'm having a problem in understanding how we got the Tf-Idf in the following program:

我已尝试使用 网站,但我使用上述概念的'a'值是

I have tried calculating the value of a in the document 2 ('And_this_is_the_third_one.') using the concept given on the site, but my value of 'a' using the above concept is

1/26*log(4/1)

1/26*log(4/1)

(('a' 字符出现的次数)/(给定的字符数document)*log(# Docs/# Docs 在哪个给定的字符发生))

((count of occurrence of 'a' character)/(no of characters in the given document)*log( # Docs/ # Docs in which given character occurred))

= 0.023156

但输出返回为 0.2203,如输出所示.

But output is returned as 0.2203 as can be seen in the output.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This_is_the_first_document.', 'This_document_is_the_second_document.', 'And_this_is_the_third_one.', 'Is_this_the_first_document?', ]
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
m = X.todense()
print(m)

使用上述概念,我预计输出为 0.023156.

I expected the output to be 0.023156 using the concept explained above.

输出为:

['.', '?', '_', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']


{'t': 15, 'h': 8, 'i': 9, 's': 14, '_': 2, 'e': 6, 'f': 7, 'r': 13, 'd': 5, 'o': 12, 'c': 4, 'u': 16, 'm': 10, 'n': 11, '.': 0, 'a': 3, '?': 1}


[[0.14540332 0.         0.47550697 0.         0.14540332 0.11887674
  0.23775349 0.17960203 0.23775349 0.35663023 0.14540332 0.11887674
  0.11887674 0.14540332 0.35663023 0.47550697 0.14540332]


 [0.10814145 0.         0.44206359 0.         0.32442434 0.26523816
  0.35365088 0.         0.17682544 0.17682544 0.21628289 0.26523816
  0.26523816 0.         0.26523816 0.35365088 0.21628289]


 [0.14061506 0.         0.57481012 0.22030066 0.         0.22992405
  0.22992405 0.         0.34488607 0.34488607 0.         0.22992405
  0.11496202 0.14061506 0.22992405 0.34488607 0.        ]


 [0.         0.2243785  0.46836004 0.         0.14321789 0.11709001
  0.23418002 0.17690259 0.23418002 0.35127003 0.14321789 0.11709001
  0.11709001 0.14321789 0.35127003 0.46836004 0.14321789]]

推荐答案

TfidfVectorizer() 已平滑添加到文档计数和 l2 标准化已应用于顶部 tf-idf 向量,如文档 中所述.

The TfidfVectorizer() has smoothing added to the document counts and l2 normalization been applied on top tf-idf vector, as mentioned in the documentation.

(字符出现的次数)/(给定的字符数文件) *
log (1 + # Docs/1 + # Docs 其中存在给定的字符) +1 )

(count of occurrence of the character)/(no of characters in the given document) *
log (1 + # Docs / 1 + # Docs in which the given character is present) +1 )

此规范化默认为 l2,但您可以使用参数 norm 更改或删除此步骤.同样,平滑可以是

This Normalization is l2 by default, but you can change or remove this step by using the parameter norm. Similarly, smoothing can be

为了了解精确分数是如何计算的,我将拟合一个 CountVectorizer() 以了解每个文档中每个字符的计数.

To understand how does the exact score is computed, I am going to fit a CountVectorizer() to know the counts of each character in every document.

countVectorizer = CountVectorizer(analyzer='char')
tf = countVectorizer.fit_transform(corpus)
tf_df = pd.DataFrame(tf.toarray(),
                     columns= countVectorizer.get_feature_names())
tf_df

#output:
   .  ?  _  a  c  d  e  f  h  i  m  n  o  r  s  t  u
0  1  0  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1
1  1  0  5  0  3  3  4  0  2  2  2  3  3  0  3  4  2
2  1  0  5  1  0  2  2  0  3  3  0  2  1  1  2  3  0
3  0  1  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1

现在让我们将基于 sklearn 实现的 tf-idf 权重应用于第二个文档!

Let us apply the tf-idf weighting based on sklearn implementation now for the second document now!

v=[]
doc_id = 2
# number of documents in the corpus + smoothing
n_d = 1+ tf_df.shape[0]

for char in tf_df.columns:
    # calculate tf - count of this char in the doc / total number chars in the doc
    tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum()

    # number of documents containing this char with smoothing 
    df_d_t = 1+ sum(tf_df.loc[:,char]>0)
    # now calculate the idf with smoothing 
    idf = (np.log (n_d/df_d_t) + 1 )

    # calculate the score now
    v.append (tf*idf)

from sklearn.preprocessing import normalize

# normalize the vector with l2 norm and create a dataframe with feature_names

pd.DataFrame(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())

#output:

       .    ?        _         a    c         d         e    f         h        i    m         n         o         r         s         t    u  
 0.140615  0.0  0.57481  0.220301  0.0  0.229924  0.229924  0.0  0.344886   0.344886  0.0  0.229924  0.114962  0.140615  0.229924  0.344886  0.0 

您会发现 char a 的分数与 TfidfVectorizer() 输出匹配!!!

you could find that the score for char a matches with the TfidfVectorizer() output!!!

这篇关于分析器='char'如何计算Tf-Idf值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆