分析器='char'如何计算Tf-Idf值? [英] How is the Tf-Idf value calculated with analyzer ='char'?
问题描述
我在理解我们如何在以下程序中获得 Tf-Idf 时遇到问题:
I'm having a problem in understanding how we got the Tf-Idf in the following program:
我已尝试使用 网站,但我使用上述概念的'a'值是
I have tried calculating the value of a
in the document 2 ('And_this_is_the_third_one.'
) using the concept given on the site, but my value of 'a' using the above concept is
1/26*log(4/1)
1/26*log(4/1)
(('a' 字符出现的次数)/(给定的字符数document)*log(# Docs/# Docs 在哪个给定的字符发生))
((count of occurrence of 'a' character)/(no of characters in the given document)*log( # Docs/ # Docs in which given character occurred))
= 0.023156
但输出返回为 0.2203,如输出所示.
But output is returned as 0.2203 as can be seen in the output.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This_is_the_first_document.', 'This_document_is_the_second_document.', 'And_this_is_the_third_one.', 'Is_this_the_first_document?', ]
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
m = X.todense()
print(m)
使用上述概念,我预计输出为 0.023156.
I expected the output to be 0.023156 using the concept explained above.
输出为:
['.', '?', '_', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']
{'t': 15, 'h': 8, 'i': 9, 's': 14, '_': 2, 'e': 6, 'f': 7, 'r': 13, 'd': 5, 'o': 12, 'c': 4, 'u': 16, 'm': 10, 'n': 11, '.': 0, 'a': 3, '?': 1}
[[0.14540332 0. 0.47550697 0. 0.14540332 0.11887674
0.23775349 0.17960203 0.23775349 0.35663023 0.14540332 0.11887674
0.11887674 0.14540332 0.35663023 0.47550697 0.14540332]
[0.10814145 0. 0.44206359 0. 0.32442434 0.26523816
0.35365088 0. 0.17682544 0.17682544 0.21628289 0.26523816
0.26523816 0. 0.26523816 0.35365088 0.21628289]
[0.14061506 0. 0.57481012 0.22030066 0. 0.22992405
0.22992405 0. 0.34488607 0.34488607 0. 0.22992405
0.11496202 0.14061506 0.22992405 0.34488607 0. ]
[0. 0.2243785 0.46836004 0. 0.14321789 0.11709001
0.23418002 0.17690259 0.23418002 0.35127003 0.14321789 0.11709001
0.11709001 0.14321789 0.35127003 0.46836004 0.14321789]]
推荐答案
TfidfVectorizer()
已平滑添加到文档计数和 l2
标准化已应用于顶部 tf-idf 向量,如文档 中所述.
The TfidfVectorizer()
has smoothing added to the document counts and l2
normalization been applied on top tf-idf vector, as mentioned in the documentation.
(字符出现的次数)/(给定的字符数文件) *
log (1 + # Docs/1 + # Docs 其中存在给定的字符) +1 )
(count of occurrence of the character)/(no of characters in the given document) *
log (1 + # Docs / 1 + # Docs in which the given character is present) +1 )
此规范化默认为 l2
,但您可以使用参数 norm
更改或删除此步骤.同样,平滑可以是
This Normalization is l2
by default, but you can change or remove this step by using the parameter norm
. Similarly, smoothing can be
为了了解精确分数是如何计算的,我将拟合一个 CountVectorizer()
以了解每个文档中每个字符的计数.
To understand how does the exact score is computed, I am going to fit a CountVectorizer()
to know the counts of each character in every document.
countVectorizer = CountVectorizer(analyzer='char')
tf = countVectorizer.fit_transform(corpus)
tf_df = pd.DataFrame(tf.toarray(),
columns= countVectorizer.get_feature_names())
tf_df
#output:
. ? _ a c d e f h i m n o r s t u
0 1 0 4 0 1 1 2 1 2 3 1 1 1 1 3 4 1
1 1 0 5 0 3 3 4 0 2 2 2 3 3 0 3 4 2
2 1 0 5 1 0 2 2 0 3 3 0 2 1 1 2 3 0
3 0 1 4 0 1 1 2 1 2 3 1 1 1 1 3 4 1
现在让我们将基于 sklearn 实现的 tf-idf 权重应用于第二个文档!
Let us apply the tf-idf weighting based on sklearn implementation now for the second document now!
v=[]
doc_id = 2
# number of documents in the corpus + smoothing
n_d = 1+ tf_df.shape[0]
for char in tf_df.columns:
# calculate tf - count of this char in the doc / total number chars in the doc
tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum()
# number of documents containing this char with smoothing
df_d_t = 1+ sum(tf_df.loc[:,char]>0)
# now calculate the idf with smoothing
idf = (np.log (n_d/df_d_t) + 1 )
# calculate the score now
v.append (tf*idf)
from sklearn.preprocessing import normalize
# normalize the vector with l2 norm and create a dataframe with feature_names
pd.DataFrame(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())
#output:
. ? _ a c d e f h i m n o r s t u
0.140615 0.0 0.57481 0.220301 0.0 0.229924 0.229924 0.0 0.344886 0.344886 0.0 0.229924 0.114962 0.140615 0.229924 0.344886 0.0
您会发现 char a
的分数与 TfidfVectorizer()
输出匹配!!!
you could find that the score for char a
matches with the TfidfVectorizer()
output!!!
这篇关于分析器='char'如何计算Tf-Idf值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!