如何阻止 NLTK 词干删除尾随的“e"? [英] How to stop NLTK stemmer from removing the trailing "e"?

查看:36
本文介绍了如何阻止 NLTK 词干删除尾随的“e"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 NLTK 词干分析器来删除词干词的语法变体.但是,Port 或 Snowball 词干分析器会删除名词或动词原始形式的尾随e",例如,Profile 变为 Profil.

I'm using NLTK stemmer to remove grammatical variations of a stem word. However, the Port or Snowball stemmers remove the trailing "e" of the original form of a noun or verb, e.g., Profile becomes Profil.

我怎样才能防止这种情况发生?我知道我可以使用条件来防止这种情况.但显然它会在不同情况下失败.

How can I prevent this from happening? I know I can use a conditional to guard against this. But obviously it will fail on different cases.

是否有我想要的选项或其他 API?

Is there an option or another API for what I want?

推荐答案

我同意 Philip 的观点,即词干分析器的目标是只保留词干.对于这种特殊情况,您可以尝试使用 lemmatizer 而不是词干提取器,它应该会保留更多的单词,并且旨在删除完全不同形式的单词,例如profiles"->profile".NLTK 中有一个用于此的类 - 尝试使用 nltk.stem 中的 WordNetLemmatizer().

I agree with Philip that the goal of stemmer is to retain only the stem. For this particular case you can try a lemmatizer instead of stemmer which will supposedly retain more of a word and is meant to remove exactly different forms of a word like 'profiles' --> 'profile'. There is a class in NLTK for this - try WordNetLemmatizer() from nltk.stem.

请注意,它仍然不完美(在处理文本时就像没有任何东西一样),因为我曾经从 'physics' 中得到 'physic'.

Beware that it's still not perfect (like nothing when working with text) because I used to get 'physic' from 'physics'.

这篇关于如何阻止 NLTK 词干删除尾随的“e"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆