如何在Lucene中进行实体提取 [英] How do I do Entity Extraction in Lucene

查看：120 发布时间：2020/5/4 7:40:39 lucene named-entity-extraction

本文介绍了如何在Lucene中进行实体提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在Lucene中进行实体提取(更像匹配).这是一个示例工作流程:

I m trying to do Entity Extraction (more like matching) in Lucene. Here is a sample workflow:

给出一些文本(从URL)并列出人的名字，尝试从文本中提取人的名字.

Given some text (from a URL) AND a list people names, try to extract names of people from the text.

注意:

人名不完全归一化.例如有些是X先生，太太. Y和一些只是John Doe，X和Y. 要考虑的其他前缀和后缀大约是Jr.，Sr.，Dr.，I，II ... 等等.(不要让我开始使用非美国名称).

Names of people are not completely normalized. e.g. Some are Mr. X, Mrs. Y and some are just John Doe, X and Y. Other prefixes and suffixes to think about are Jr., Sr., Dr., I, II ... etc. (dont let me get started with non US names).

我正在使用Lucene

I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have).

一个主要问题是，以0.01的基本分数消除噪声I m，并且如果文本包含"John Doe"，则与"John Doe"相比，"John Doe先生"之类的查询的得分要低得多"，并且在许多情况下未达到0.01阈值.

A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold.

另一个问题是，如果我对所有姓名进行规范化并开始删除Mrs. Mrs.等的所有出现，那么我会开始错过像"John Edward II博士"这样的优质比赛，并最终导致很多垃圾比赛像约翰·爱德华先生".

The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward".

我知道Lucene可能也不是完成这项工作的合适工具，但是到目前为止，事实证明还不算太糟糕.任何帮助表示赞赏.

I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. Any help appreciated.

如何在Lucene中进行实体提取 [英] How do I do Entity Extraction in Lucene

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Lucene中进行实体提取 [英] How do I do Entity Extraction in Lucene

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭