阻止词干损害文本分类的准确性吗? [英] Does stemming harm precision in text classification?

查看:140
本文介绍了阻止词干损害文本分类的准确性吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了词干损害准确性,但提高了文本分类的回忆性.这是怎么发生的?阻止时,您会增加查询和示例文档之间的匹配数量,对吗?

I have read stemming harms precision but improves recall in text classification. How does that happen? When you stem you increase the number of matches between the query and the sample documents right?

推荐答案

它总是相同的,如果您回想起,则可以进行概括.因此,您将失去精度.词干将单词合并在一起.

It's always the same, if you raise recall, your doing a generalisation. Because of that, you're losing precision. Stemming merge words together.

一方面,应该合并在一起的词(例如"adhere"和"adhesion")在词干之后可能会保持不同;另一方面,确实不同的词可能会被错误地混淆(例如,实验"和经验").这些分别称为不足引纸错误"和过度引纸错误".

On the one hand, words which ought to be merged together (such as "adhere" and "adhesion") may remain distinct after stemming; on the other, words which are really distinct may be wrongly conflated (e.g., "experiment" and "experience"). These are known as understemming errors and overstemming errors respectively.

过度塞入会降低精度,而过度塞入会降低召回率.因此,由于完全没有词干就意味着没有词根过度错误,但最大的词根错误是最大的,因此召回率很低,而且精度很高.

Overstemming lowers precision and understemming lowers recall. So, since no stemming at all means no over- but max understemming errors, you have a low recall there and a high precision.

顺便说一句,精度意味着您正在寻找的找到的文档"数量.召回意味着您收到了所有正确的文档"中的多少.

Btw, precision means how many of your found 'documents' are those you were looking for. Recall means how many of all 'documents', which were correct, you received.

这篇关于阻止词干损害文本分类的准确性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆