查找具有不同形式的英语单词的数据库或文本文件 [英] Looking for a database or text file of english words with their different forms
问题描述
我正在做一个项目,我需要弄清楚一个给定单词的词根(词干).如您所知,不使用字典的词干算法是不准确的.我也尝试过WordNet,但这对我的项目不利.我找到了phpmorphy项目,但它不包含Java中的API.
I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java.
这时候我正在寻找数据库或英语单词的文本文件,它们的格式不同.例如:
At this time I am looking for a database or a text file of english words with their different forms. for example:
运行跑步跑... 包括包括在内... ...
run running ran ... include including included ... ...
感谢您的帮助或建议.
Thank you for your help or advise.
推荐答案
您可以下载 LanguageTool (免责声明:我是维护者),它带有二进制文件english.dict
. LanguageTool Wiki 描述了如何将该文件转储为文本文件:>
You could download LanguageTool (Disclaimer: I'm the maintainer), which comes with a binary file english.dict
. The LanguageTool Wiki describes how to dump that file as a text file:
java -jar morfologik-tools-1.6.0-standalone.jar fsa_dump -x -d english.dict
对于run
,文件将包含以下内容:
For run
, the file will contain this:
ran run VBD
run run NN
run run VB
run run VBN
run run VBP
running run VBG
runs run NNS
runs run VBZ
根据(略微扩展) 查看全文