词库/词干词典 [英] Word Base/Stem Dictionary

查看:109
本文介绍了词库/词干词典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

看来我的Google-fu使我失败了.

It seems my Google-fu is failing me.

有人知道免费提供的仅包含词库的词库吗?因此,对于像草莓这样的东西,它会带有草莓.但是,是否不包含缩写或拼写错误或替代拼写(例如英国与美国)?在Java中可以快速使用的任何东西都是不错的选择,但是仅是映射文本文件或任何可以读取的东西都会有所帮助.

Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful.

推荐答案

这称为引理化,而您所说的词的基数"称为引理. morpha 及其重新实现可以做到这一点.但是,两者都需要使用POS标记输入才能解决自然语言中的固有歧义.

This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.

(POS标记意味着确定单词类别,例如名词,动词.我一直假设您要使用处理英语的工具.)

(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)

编辑:由于您将使用它进行搜索,因此这里有一些提示:

Edit: since you're going to use this for search, here's a few tips:

  • 简单的英语词干在搜索引擎世界中享有不同的声誉.有时它可以工作,但有时却不行.
  • 自动拼写更正可能会更好.这就是 Google 所做的.但是,如果要正确执行,则在计算时间方面会很昂贵.
  • 合法化可能会带来好处,但只有当您索引并同时搜索单词和引词时,这才可能带来好处. (同样的建议也可以阻止).
  • 这是一个 Lucene的插件,它可以进行词法归类.
  • Simple stemming for English has a mixed reputation in the search engine world. Sometimes it works, often it doesn't.
  • Automatic spelling correction may work better. This is what Google does. It's expensive in terms of computing time, though, if you want to do it right.
  • Lemmatization may provide benefits, but probably only if you index and search for both the words and the lemmas. (Same advice goes for stemming.)
  • Here's a plugin for Lucene that does lemmatization.

(前面的评论是根据我自己的研究得出的;我写了我的硕士论文有关在搜索引擎中针对非常嘈杂的数据进行词形还原.)

(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)

这篇关于词库/词干词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆