如何阻止 NLTK 词干删除尾随的“e"? [英] How to stop NLTK stemmer from removing the trailing "e"?
问题描述
我正在使用 NLTK 词干分析器来删除词干词的语法变体.但是,Port 或 Snowball 词干分析器会删除名词或动词原始形式的尾随e",例如,Profile 变为 Profil.
I'm using NLTK stemmer to remove grammatical variations of a stem word. However, the Port or Snowball stemmers remove the trailing "e" of the original form of a noun or verb, e.g., Profile becomes Profil.
我怎样才能防止这种情况发生?我知道我可以使用条件来防止这种情况.但显然它会在不同情况下失败.
How can I prevent this from happening? I know I can use a conditional to guard against this. But obviously it will fail on different cases.
是否有我想要的选项或其他 API?
Is there an option or another API for what I want?
推荐答案
我同意 Philip 的观点,即词干分析器的目标是只保留词干.对于这种特殊情况,您可以尝试使用 lemmatizer 而不是词干提取器,它应该会保留更多的单词,并且旨在删除完全不同形式的单词,例如profiles"->profile".NLTK 中有一个用于此的类 - 尝试使用 nltk.stem 中的 WordNetLemmatizer().
I agree with Philip that the goal of stemmer is to retain only the stem. For this particular case you can try a lemmatizer instead of stemmer which will supposedly retain more of a word and is meant to remove exactly different forms of a word like 'profiles' --> 'profile'. There is a class in NLTK for this - try WordNetLemmatizer() from nltk.stem.
请注意,它仍然不完美(在处理文本时就像没有任何东西一样),因为我曾经从 'physics' 中得到 'physic'.
Beware that it's still not perfect (like nothing when working with text) because I used to get 'physic' from 'physics'.
这篇关于如何阻止 NLTK 词干删除尾随的“e"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!