我如何进行词干或词形还原? [英] How do I do word Stemming or Lemmatization?

查看:33
本文介绍了我如何进行词干或词形还原?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试过 PorterStemmer 和 Snowball,但都不能处理所有单词,缺少一些非常常见的单词.

我的测试词是:cats running running cactus cactuses cacti community community",并且两者都做对了不到一半.

另见:

解决方案

如果您了解 Python,Natural Language Toolkit (NLTK)) 有一个非常强大的 lemmatizer,它利用了 WordNet.

请注意,如果您是第一次使用此词形还原器,则必须在使用前下载语料库.这可以通过以下方式完成:

<预><代码>>>>导入 nltk>>>nltk.download('wordnet')

您只需执行此操作一次.假设您现在已经下载了语料库,它的工作原理如下:

<预><代码>>>>从 nltk.stem.wordnet 导入 WordNetLemmatizer>>>lmtzr = WordNetLemmatizer()>>>lmtzr.lemmatize('汽车')'车'>>>lmtzr.lemmatize('脚')'脚'>>>lmtzr.lemmatize('人')'人们'>>>lmtzr.lemmatize('幻想','v')'幻想'

nltk.stem 模块中还有其他词形还原器,但我没有自己没试过.

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

解决方案

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

这篇关于我如何进行词干或词形还原?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆