如何在Lucene中进行实体提取 [英] How do I do Entity Extraction in Lucene

查看:120
本文介绍了如何在Lucene中进行实体提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Lucene中进行实体提取(更像匹配).这是一个示例工作流程:

I m trying to do Entity Extraction (more like matching) in Lucene. Here is a sample workflow:

给出一些文本(从URL)并列出人的名字,尝试从文本中提取人的名字.

Given some text (from a URL) AND a list people names, try to extract names of people from the text.

注意:

人名不完全 归一化.例如有些是X先生,太太. Y和一些只是John Doe,X和Y. 要考虑的其他前缀和后缀 大约是Jr.,Sr.,Dr.,I,II ... 等等.(不要让我开始使用非 美国名称).

Names of people are not completely normalized. e.g. Some are Mr. X, Mrs. Y and some are just John Doe, X and Y. Other prefixes and suffixes to think about are Jr., Sr., Dr., I, II ... etc. (dont let me get started with non US names).

我正在使用Lucene

I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have).

一个主要问题是,以0.01的基本分数消除噪声I m,并且如果文本包含"John Doe",则与"John Doe"相比,"John Doe先生"之类的查询的得分要低得多",并且在许多情况下未达到0.01阈值.

A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold.

另一个问题是,如果我对所有姓名进行规范化并开始删除Mrs. Mrs.等的所有出现,那么我会开始错过像"John Edward II博士"这样的优质比赛,并最终导致很多垃圾比赛像约翰·爱德华先生".

The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward".

我知道Lucene可能也不是完成这项工作的合适工具,但是到目前为止,事实证明还不算太糟糕.任何帮助表示赞赏.

I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. Any help appreciated.

推荐答案

NEE是NLP任务,它不是Lucene的一部分.对于开源,您可以查看lingpipe以及gate和opennlp.有很多物有所值的选择.

NEE is an NLP task that is not part of lucene. For open source, you can look at lingpipe and gate and opennlp. There are various for-money alternatives.

GATE完全基于规则,因此很难用于高精度.您将需要一个统计引擎; lingpipe有一个,但您必须提供训练数据.我对这方面的opennlp的内容不了解.

GATE is entirely rule-based, and will be hard to use for high precision. You'll need a statistical engine for that; lingpipe has one, but you have to supply the training data. I'm not up to date on the contents of opennlp in this area.

这篇关于如何在Lucene中进行实体提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆