在提供 Lucene 索引时使用免费工具进行实体提取/识别 [英] Entity Extraction/Recognition with free tools while feeding Lucene Index
问题描述
我目前正在研究从文本(网络上的很多文章)中提取人名、位置、技术词汇和类别的选项,然后将其输入 Lucene/ElasticSearch 索引.然后将附加信息添加为元数据,并应提高搜索的精度.
I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
例如当有人查询wicket"时,他应该能够确定他是指板球运动还是 Apache 项目.到目前为止,我试图自己实现这一点,但取得了轻微的成功.现在我找到了很多工具,但我不确定它们是否适合这项任务,以及它们中的哪些与 Lucene 集成得很好,或者实体提取的精度是否足够高.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
- Dbpedia Spotlight,demo 看起来很有希望
- OpenNLP 需要 培训.要使用哪些训练数据?
- OpenNLP 工具
- Stanbol
- NLTK
- balie
- UIMA
- GATE -> 示例代码
- Apache Mahout
- 斯坦福 CRF-NER
- maui-indexer
- 木槌
- Illinois Named Entity Tagger 非开源但免费
- wikipedianer数据
- Dbpedia Spotlight, the demo looks very promising
- OpenNLP requires training. Which training data to use?
- OpenNLP tools
- Stanbol
- NLTK
- balie
- UIMA
- GATE -> example code
- Apache Mahout
- Stanford CRF-NER
- maui-indexer
- Mallet
- Illinois Named Entity Tagger Not open source but free
- wikipedianer data
我的问题:
- 有没有人使用上面列出的一些工具及其精度/召回率?或者,如果需要 + 可用的训练数据.
- 是否有文章或教程可以让我开始使用每个工具的实体提取 (NER)?
- 它们如何与 Lucene 集成?
以下是与该主题相关的一些问题:
Here are some questions related to that subject:
推荐答案
您在wicket"示例中面临的问题称为实体消歧,而不是实体提取/识别 (NER).NER 可能很有用,但前提是类别足够具体.大多数 NER 系统没有足够的粒度来区分运动和软件项目(这两种类型都超出了通常公认的类型:人、组织、位置).
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
为了消除歧义,您需要一个知识库来消除实体的歧义.DBpedia 是一个典型的选择,因为它覆盖面广.请参阅我对 如何使用 DBPedia 的回答从内容中提取标签/关键字?我提供了更多解释,并提到了几种消歧工具,包括:
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
- 泽曼塔
- Maui-indexer
- Dbpedia 聚焦
Extractiv(我的公司)
这些工具通常使用 REST 等独立于语言的 API,我不知道它们是否直接提供 Lucene 支持,但希望我的回答对您尝试解决的问题有所帮助.
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
这篇关于在提供 Lucene 索引时使用免费工具进行实体提取/识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!