如何从文本中提取关键字(标签) [英] How to extract keywords (tags) from text
问题描述
我目前正在尝试用Java实现标记引擎,并正在寻找从文本(文章)中提取关键字/标记的解决方案.我发现了一些有关stackoverflow的解决方案,建议使用逐点相互信息.
i am currently trying to implement a tagging engine in Java and searched for solutions to extract keywords/tag from texts (articles). I have found some solutions on stackoverflow suggesting to use Pointwise Mutual Information.
我不能使用pyton和nltk,所以我必须自己实现.但是我不知道如何计算概率. 等式如下所示:
I cant use pyton and nltk so i have to implement it myself. But i dont know how to calculate the probabilities. The equation looks like this:
PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
我想知道的是如何计算P(term,doc)
What i want to know is how to calculate P(term, doc)
我已经有一个lange文本语料库和一系列文章.这些文章不属于语料库.语料库用lucene索引.
I already have a lange text corpus and a collection of articles. The articles are not part of the corpus. The corpus is indexed with lucene.
请帮帮我. 最好的问候.
Please help me out. Best regards.
推荐答案
有很多算法可以做到这一点:
There are lot of algorithms for doing this:
开源工具:
kea( http://www.nzdl.org/Kea/)受监督的方法使用了训练数据和受控词汇表
kea(http://www.nzdl.org/Kea/) supervised approach uses training data and controlled vocabulary
毛伊岛索引器( http://code.google.com/p/maui-indexer/)基本上是kea的扩展,它提供了使用百科全书库提取关键短语的工具.
maui indexer(http://code.google.com/p/maui-indexer/) it is basically extension of kea which provide facility to use encyclopedia for key phrase extraction.
carrot2( http://project.carrot2.org/)一种无监督的关键词提取方法.它支持大量的输入,输出格式和参数变化,以提取关键短语.
carrot2(http://project.carrot2.org/) unsupervised approach for key phrase extraction. it supports lot of variation of input, output format and parameters for key phrase extraction.
mallet主题建模模块( http://mallet.cs.umass.edu/topics.php )
mallet topic modeling module(http://mallet.cs.umass.edu/topics.php)
斯坦福主题建模工具( http://nlp.stanford.edu/software/tmt/tmt-0.3/)
Stanford topic modeling tool (http://nlp.stanford.edu/software/tmt/tmt-0.3/)
Mahout群集算法( http://mahout.apache.org/)
Mahout clustering algorithms(http://mahout.apache.org/)
商业api:
Alchemy API( http://www.alchemyapi.com/api/keyword-extraction/)
Alchemy API(http://www.alchemyapi.com/api/keyword-extraction/)
zemanta API( http://www.zemanta.com/developer/)
zemanta API(http://www.zemanta.com/developer/)
yahoo术语提取api( http://developer.yahoo.com/contentanalysis/)
yahoo term extraction api(http://developer.yahoo.com/contentanalysis/)
这篇关于如何从文本中提取关键字(标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!