如何从文本中提取关键字(标签) [英] How to extract keywords (tags) from text

查看:1137
本文介绍了如何从文本中提取关键字(标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试用Java实现标记引擎,并正在寻找从文本(文章)中提取关键字/标记的解决方案.我发现了一些有关stackoverflow的解决方案,建议使用逐点相互信息.

i am currently trying to implement a tagging engine in Java and searched for solutions to extract keywords/tag from texts (articles). I have found some solutions on stackoverflow suggesting to use Pointwise Mutual Information.

解决方案1 ​​

解决方案2

我不能使用pyton和nltk,所以我必须自己实现.但是我不知道如何计算概率. 等式如下所示:

I cant use pyton and nltk so i have to implement it myself. But i dont know how to calculate the probabilities. The equation looks like this:

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

我想知道的是如何计算P(term,doc)

What i want to know is how to calculate P(term, doc)

我已经有一个lange文本语料库和一系列文章.这些文章不属于语料库.语料库用lucene索引.

I already have a lange text corpus and a collection of articles. The articles are not part of the corpus. The corpus is indexed with lucene.

请帮帮我. 最好的问候.

Please help me out. Best regards.

推荐答案

有很多算法可以做到这一点:

There are lot of algorithms for doing this:

开源工具:

kea( http://www.nzdl.org/Kea/)受监督的方法使用了训练数据和受控词汇表

kea(http://www.nzdl.org/Kea/) supervised approach uses training data and controlled vocabulary

毛伊岛索引器( http://code.google.com/p/maui-indexer/)基本上是kea的扩展,它提供了使用百科全书库提取关键短语的工具.

maui indexer(http://code.google.com/p/maui-indexer/) it is basically extension of kea which provide facility to use encyclopedia for key phrase extraction.

carrot2( http://project.carrot2.org/)一种无监督的关键词提取方法.它支持大量的输入,输出格式和参数变化,以提取关键短语.

carrot2(http://project.carrot2.org/) unsupervised approach for key phrase extraction. it supports lot of variation of input, output format and parameters for key phrase extraction.

mallet主题建模模块( http://mallet.cs.umass.edu/topics.php )

mallet topic modeling module(http://mallet.cs.umass.edu/topics.php)

斯坦福主题建模工具( http://nlp.stanford.edu/software/tmt/tmt-0.3/)

Stanford topic modeling tool (http://nlp.stanford.edu/software/tmt/tmt-0.3/)

Mahout群集算法( http://mahout.apache.org/)

Mahout clustering algorithms(http://mahout.apache.org/)

商业api:

Alchemy API( http://www.alchemyapi.com/api/keyword-extraction/)

Alchemy API(http://www.alchemyapi.com/api/keyword-extraction/)

zemanta API( http://www.zemanta.com/developer/)

zemanta API(http://www.zemanta.com/developer/)

yahoo术语提取api( http://developer.yahoo.com/contentanalysis/)

yahoo term extraction api(http://developer.yahoo.com/contentanalysis/)

这篇关于如何从文本中提取关键字(标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆