在 JAVA 中使用哪个 NLP 工具包? [英] Which NLP toolkit to use in JAVA?

查看:19
本文介绍了在 JAVA 中使用哪个 NLP 工具包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个项目,该项目由一个网站组成,该网站连接到 NCBI(国家生物技术信息中心)并在那里搜索文章.事情是我必须对所有结果进行一些文本挖掘.我使用 JAVA 语言进行文本挖掘,使用 AJAX 和 ICEFACES 开发网站.我有什么:从搜索返回的文章列表.每篇文章都有一个 ID 和一个摘要.这个想法是从每个抽象文本中获取关键字.然后比较所有摘要中的所有关键词,找出重复最多的关键词.然后在网站上显示搜索的相关词.有任何想法吗 ?我在网上搜索了很多,我知道有命名实体识别,词性标注,有基因和蛋白质的 NER GENIA 词库,我已经尝试过词干......停用词列表等......我只需要知道最好的 aproahc 来解决这个问题.非常感谢.

i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. What do I have : A list of articles returned from a search. Each article has an ID and an abstract. The idea is to get keywords from each abstract text. And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search. Any ideas ? I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc... I just need to know the best aproahc to resolve this problem. Thanks a lot.

推荐答案

我建议您使用 POS 标记和字符串标记的组合来提取每个摘要中的所有名词.然后使用某种字典/哈希计算每个名词的频率,然后输出 N 个最多产的名词.结合其他一些智能过滤机制应该可以很好地为您提供摘要中的重要关键字
对于 POS 标记,请查看 http://nlp.stanford.edu/software/index 上的 POS 标记器.shtml

i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml

但是,如果您希望语料库中有很多多词术语......而不是仅提取名词,您可以采用最多产的 n-grams for n=2 to 4

However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4

这篇关于在 JAVA 中使用哪个 NLP 工具包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆