词干对词频的影响? [英] Effects of Stemming on the term frequency?

查看:92
本文介绍了词干对词频的影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

术语频率(TF)和逆文档频率(IDF)受停用词删除和词干影响如何?

How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming?

谢谢!

推荐答案

tf 是术语频率
idf 是反向文档频率,即通过将文档总数除以包含该术语的文档数量,然后取该商的对数来获得。

tf is term frequency idf is inverse document frequency which is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

除梗将来自同一词干的所有词(例如:played,play ..)进行分组,这将增加词干的出现率,因为频率是使用词干而不是词
计算的,例如2个文档:
第一个文档包含播放 2次和播放 5次,
,第二个文档包含播放 3次和播放 1次
在不阻止第二个文档的情况下搜索播放将是第一个,因为它出现更多的单词 pla y,而如果您进行词干,则词干后两个单词都将被播放,并且第一个文档将成为第一个文档,这是因为该单词包含 stem 播放了7次,第二个文档包含了 stem 播放4次。

stemming effect is grouping all words which are derived from the same stem (ex: played, play,..), this grouping will increase the occurrence of this stem because frequencies are calculated using stem not words, For example, if you have 2 documents: the first one contains 'play' 2 times and 'played' 5 times, and the second document contains 'play' 3 times and 'played' 1 time if you do a search for 'play' without stemming the second document will be first because it has more occurrence of the word 'play', while if you do stemming, both words will be 'play' after stemming and the first document will be first cause it contains the stem play 7 times and the second document contains the stem play 4 times.

关于停用词删除,在所有文档中都经常发现停用词,因此该词不被视为其中任何一个的关键字没有任何场面的高频率。

Concerning stopwords removal, it is found frequently in all document and isn't consider as a keyword for any of them, it will have high freq without any scene.

这篇关于词干对词频的影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆