如何使用Analyzer ='char'计算Tf-Idf值? [英] How is the Tf-Idf value calculated with analyzer ='char'?

查看:105
本文介绍了如何使用Analyzer ='char'计算Tf-Idf值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在了解如何通过以下程序获取Tf-Idf时遇到问题:

I'm having a problem in understanding how we got the Tf-Idf in the following program:

我尝试使用

I have tried calculating the value of a in the document 2 ('And_this_is_the_third_one.') using the concept given on the site, but my value of 'a' using the above concept is

1/26 * log(4/1)

1/26*log(4/1)

((('a'字符出现的次数)/(给定字符数 document)* log(#Docs/#包含给定字符的Docs 发生))

((count of occurrence of 'a' character)/(no of characters in the given document)*log( # Docs/ # Docs in which given character occurred))

= 0.023156

= 0.023156

但是从输出中可以看到,输出返回为0.2203.

But output is returned as 0.2203 as can be seen in the output.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This_is_the_first_document.', 'This_document_is_the_second_document.', 'And_this_is_the_third_one.', 'Is_this_the_first_document?', ]
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
m = X.todense()
print(m)

使用上述概念,我希望输出为0.023156.

I expected the output to be 0.023156 using the concept explained above.

输出为:

['.', '?', '_', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']


{'t': 15, 'h': 8, 'i': 9, 's': 14, '_': 2, 'e': 6, 'f': 7, 'r': 13, 'd': 5, 'o': 12, 'c': 4, 'u': 16, 'm': 10, 'n': 11, '.': 0, 'a': 3, '?': 1}


[[0.14540332 0.         0.47550697 0.         0.14540332 0.11887674
  0.23775349 0.17960203 0.23775349 0.35663023 0.14540332 0.11887674
  0.11887674 0.14540332 0.35663023 0.47550697 0.14540332]


 [0.10814145 0.         0.44206359 0.         0.32442434 0.26523816
  0.35365088 0.         0.17682544 0.17682544 0.21628289 0.26523816
  0.26523816 0.         0.26523816 0.35365088 0.21628289]


 [0.14061506 0.         0.57481012 0.22030066 0.         0.22992405
  0.22992405 0.         0.34488607 0.34488607 0.         0.22992405
  0.11496202 0.14061506 0.22992405 0.34488607 0.        ]


 [0.         0.2243785  0.46836004 0.         0.14321789 0.11709001
  0.23418002 0.17690259 0.23418002 0.35127003 0.14321789 0.11709001
  0.11709001 0.14321789 0.35127003 0.46836004 0.14321789]]

推荐答案

TfidfVectorizer()已将平滑处理添加到文档计数中,并且l2归一化已应用到顶部tf-idf向量上,如文档.

The TfidfVectorizer() has smoothing added to the document counts and l2 normalization been applied on top tf-idf vector, as mentioned in the documentation.

(字符出现的次数)/(给定字符数) 文档)*
日志(1 +#文档/1 +#文档中存在给定字符)+1)

(count of occurrence of the character)/(no of characters in the given document) *
log (1 + # Docs / 1 + # Docs in which the given character is present) +1 )

此归一化默认为l2,但是您可以使用参数norm更改或删除此步骤.同样,平滑可以是

This Normalization is l2 by default, but you can change or remove this step by using the parameter norm. Similarly, smoothing can be

要了解如何计算准确分数,我将使用CountVectorizer()来了解每个文档中每个字符的计数.

To understand how does the exact score is computed, I am going to fit a CountVectorizer() to know the counts of each character in every document.

countVectorizer = CountVectorizer(analyzer='char')
tf = countVectorizer.fit_transform(corpus)
tf_df = pd.DataFrame(tf.toarray(),
                     columns= countVectorizer.get_feature_names())
tf_df

#output:
   .  ?  _  a  c  d  e  f  h  i  m  n  o  r  s  t  u
0  1  0  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1
1  1  0  5  0  3  3  4  0  2  2  2  3  3  0  3  4  2
2  1  0  5  1  0  2  2  0  3  3  0  2  1  1  2  3  0
3  0  1  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1

现在让我们现在基于第二个文档应用基于sklearn实现的tf-idf加权!

Let us apply the tf-idf weighting based on sklearn implementation now for the second document now!

v=[]
doc_id = 2
# number of documents in the corpus + smoothing
n_d = 1+ tf_df.shape[0]

for char in tf_df.columns:
    # calculate tf - count of this char in the doc / total number chars in the doc
    tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum()

    # number of documents containing this char with smoothing 
    df_d_t = 1+ sum(tf_df.loc[:,char]>0)
    # now calculate the idf with smoothing 
    idf = (np.log (n_d/df_d_t) + 1 )

    # calculate the score now
    v.append (tf*idf)

from sklearn.preprocessing import normalize

# normalize the vector with l2 norm and create a dataframe with feature_names

pd.DataFrame(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())

#output:

       .    ?        _         a    c         d         e    f         h        i    m         n         o         r         s         t    u  
 0.140615  0.0  0.57481  0.220301  0.0  0.229924  0.229924  0.0  0.344886   0.344886  0.0  0.229924  0.114962  0.140615  0.229924  0.344886  0.0 

您会发现char a的分数与TfidfVectorizer()输出相匹配!

you could find that the score for char a matches with the TfidfVectorizer() output!!!

这篇关于如何使用Analyzer ='char'计算Tf-Idf值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆