TFIDF计算混乱 [英] TFIDF calculating confusion
本文介绍了TFIDF计算混乱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在互联网上找到了以下代码来计算TFIDF:
I found the following code on the internet for calculating TFIDF:
https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py
我在函数中添加了 1+ def idf(word,documentList),所以我不会被0错误除:
I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:
return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))
但是我很困惑两件事:
- 在某些情况下我得到负值,这是正确的吗?
- 我是与第62、63和64行混淆。
代码:
documentNumber = 0
for word in documentList[documentNumber].split(None):
words[word] = tfidf(word,documentList[documentNumber],documentList)
仅应在第一个文档上计算TFIDF?
Should TFIDF be calculated on the first document only?
推荐答案
- 否。 Tf-idf是tf,一个非负值,乘以idf,一个非负值,因此它永远不能为负。该代码似乎正在实现 tf-idf的错误定义已在Wikipedia上使用多年(已经在此期间固定)。
- No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).
这篇关于TFIDF计算混乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文