在提供 Lucene 索引时使用免费工具进行实体提取/识别 [英] Entity Extraction/Recognition with free tools while feeding Lucene Index

查看:18
本文介绍了在提供 Lucene 索引时使用免费工具进行实体提取/识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究从文本(网络上的很多文章)中提取人名、位置、技术词汇和类别的选项,然后将其输入 Lucene/ElasticSearch 索引.然后将附加信息添加为元数据,并应提高搜索的精度.

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

例如当有人查询wicket"时,他应该能够确定他是指板球运动还是 Apache 项目.到目前为止,我试图自己实现这一点,但取得了轻微的成功.现在我找到了很多工具,但我不确定它们是否适合这项任务,以及它们中的哪些与 Lucene 集成得很好,或者实体提取的精度是否足够高.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

  • Dbpedia Spotlight, the demo looks very promising
  • OpenNLP requires training. Which training data to use?
  • OpenNLP tools
  • Stanbol
  • NLTK
  • balie
  • UIMA
  • GATE -> example code
  • Apache Mahout
  • Stanford CRF-NER
  • maui-indexer
  • Mallet
  • Illinois Named Entity Tagger Not open source but free
  • wikipedianer data

我的问题:

  • 有没有人使用上面列出的一些工具及其精度/召回率?或者,如果需要 + 可用的训练数据.
  • 是否有文章或教程可以让我开始使用每个工具的实体提取 (NER)?
  • 它们如何与 Lucene 集成?

以下是与该主题相关的一些问题:

Here are some questions related to that subject:

推荐答案

您在wicket"示例中面临的问题称为实体消歧,而不是实体提取/识别 (NER).NER 可能很有用,但前提是类别足够具体.大多数 NER 系统没有足够的粒度来区分运动和软件项目(这两种类型都超出了通常公认的类型:人、组织、位置).

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

为了消除歧义,您需要一个知识库来消除实体的歧义.DBpedia 是一个典型的选择,因为它覆盖面广.请参阅我对 如何使用 DBPedia 的回答从内容中提取标签/关键字?我提供了更多解释,并提到了几种消歧工具,包括:

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

这些工具通常使用 REST 等独立于语言的 API,我不知道它们是否直接提供 Lucene 支持,但希望我的回答对您尝试解决的问题有所帮助.

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

这篇关于在提供 Lucene 索引时使用免费工具进行实体提取/识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆