我该如何词干或词法去除? [英] How do I do word Stemming or Lemmatization?

查看:59
本文介绍了我该如何词干或词法去除?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试过PorterStemmer和Snowball,但都无法在所有单词上使用,缺少一些非常常见的单词.

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

我的测试词是:"奔跑的仙人掌的猫仙人掌仙人掌社区社区",而且两人都获得了不到一半的权利.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

另请参见:

  • Stemming algorithm that produces real words
  • Stemming - code examples or open source projects?

推荐答案

如果您了解Python,则自然语言工具包(NLTK )具有功能非常强大的lemmatizer,它可以使用 WordNet .

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

请注意,如果您是初次使用该词法分析器,则必须先下载语料库,然后再使用它.这可以通过以下方式完成:

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

您只需执行一次.假设您现在已经下载了语料库,它的工作方式如下:

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

nltk.stem模块中还有其他定词器,但我没有我自己没有尝试过.

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

这篇关于我该如何词干或词法去除?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆