在提供Lucene Index时使用免费工具进行实体提取/识别 [英] Entity Extraction/Recognition with free tools while feeding Lucene Index

查看:100
本文介绍了在提供Lucene Index时使用免费工具进行实体提取/识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究从文本(网络上的很多文章)中提取人物姓名,位置,技术用语和类别的选项,然后将其输入到Lucene/ElasticSearch索引中.然后,附加信息将作为元数据添加,并应提高搜索的准确性.

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

例如当有人询问检票口"时,他应该能够决定他是指板球运动还是阿帕奇项目.到目前为止,我尝试自己实施此方法,但收效甚微.现在,我发现了很多工具,但是我不确定它们是否适合此任务,哪些与Lucene集成得很好,或者实体提取的精度是否足够高.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

  • Dbpedia Spotlight, the demo looks very promising
  • OpenNLP requires training. Which training data to use?
  • OpenNLP tools
  • Stanbol
  • NLTK
  • balie
  • UIMA
  • GATE -> example code
  • Apache Mahout
  • Stanford CRF-NER
  • maui-indexer
  • Mallet
  • Illinois Named Entity Tagger Not open source but free
  • wikipedianer data

我的问题:

  • 是否有人对上面列出的某些工具及其精度/召回率有经验?或者,如果需要培训数据+可用.
  • 是否有文章或教程可让您开始使用每种工具的实体提取(NER)?
  • 如何将它们与Lucene集成?

以下是与该主题相关的一些问题:

Here are some questions related to that subject:

  • Does an algorithm exist to help detect the "primary topic" of an English sentence?
  • Named Entity Recognition Libraries for Java
  • Named entity recognition with Java

推荐答案

在检票口"示例中面临的问题称为实体消歧,而不是实体提取/识别(NER). NER可能有用,但仅当类别足够具体时才有用.大多数NER系统没有足够的粒度来区分运动项目和软件项目(这两种类型都超出了通常公认的类型:人员,组织,位置).

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

要消除歧义,您需要一个针对实体进行歧义消除的知识库.由于DBpedia具有广泛的覆盖范围,因此是一个典型的选择.请参阅我的答案,以获取如何使用DBPedia提取内容中的标签/关键字?,在这里我提供了更多解释,并提到了一些用于歧义消除的工具,包括:

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

  • Zemanta
  • Maui-indexer
  • Dbpedia Spotlight
  • Extractiv (my company)

这些工具通常使用诸如REST之类的独立于语言的API,我不知道它们直接提供了Lucene支持,但我希望我的回答对您要解决的问题有所帮助.

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

这篇关于在提供Lucene Index时使用免费工具进行实体提取/识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆