Python 词干分析器问题:词干错误 [英] Python stemmer issue: wrong stem

查看:34
本文介绍了Python 词干分析器问题:词干错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 词干分析器来提取词干,我尝试过 Porter 和 Lancaster,但它们有同样的问题.他们无法阻止以er"或e"结尾的正确词.

Hi i'm trying to stem words with a python stemmer, i tried Porter and Lancaster, but they have the same problem. They can't stem correclty words that end with "er" or "e".

例如,它们干

computer -->  comput

rotate   -->  rotat

这是代码的一部分

line=line.lower()
line=re.sub(r'[^a-z0-9 ]',' ',line)
line=line.split()
line=[x for x in line if x not in stops]
line=[ porter.stem(word, 0, len(word)-1) for word in line]
# or 'line=[ st.stem(word) for word in line]'
return line

有解决这个问题的想法吗?

any idea to fix this problem?

推荐答案

引用维基百科页面, 在计算语言学中,词干是词的一部分,即使在形态变化时也不会改变,而引理是词的基本形式.例如,给定生产"这个词,它的引理(语言学)是生产",而词干是生产":这是因为有生产这样的词.所以你的代码很可能会给你正确的结果.您似乎期望引理不是词干生成器产生的(除非引理恰好等于词干)

To quote the page on Wikipedia, In computational linguistics, a stem is the part of the word that never changes even when morphologically inflected, whilst a lemma is the base form of the word. For example, given the word "produced", its lemma (linguistics) is "produce", however the stem is "produc": this is because there are words such as production. So your code is likely giving you correct results. You seem to expect a lemma which is not what a stemmer produces (except when the lemma happens to equal the stem)

这篇关于Python 词干分析器问题:词干错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆