零编辑距离的基于字典的命名实体识别:LingPipe,Lucene还是什么? [英] Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?

查看:169
本文介绍了零编辑距离的基于字典的命名实体识别:LingPipe,Lucene还是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在某些文档上执行基于字典的NER。无论数据类型如何,我的字典都包含字符串的键值对。我想搜索文档中的所有键,并在匹配发生时返回该键的相应值。

I'm trying to perform a dictionary-based NER on some documents. My dictionary, regardless of the datatype, consists of key-value pairs of strings. I want to search for all the keys in the document, and return the corresponding value for that key whenever a match occurs.

问题是,我的字典相当大: 〜700万个键值 - 键的平均长度:8和平均值的长度:20个字符。

The problem is, my dictionary is fairly large: ~7 million key-values - average length of keys: 8 and average length of values: 20 characters.

我已经尝试使用MapDictionary的LingPipe,但是在我想要的环境设置上,插入200,000行后,内存不足。我不清楚为什么LingPipe在他们的算法中使用地图而不是散列图。

I've tried LingPipe with MapDictionary but on my desired environment setup, it runs out of memory after 200,000 rows are inserted. I don't know clearly why LingPipe uses a map and not a hashmap in their algorithm.

所以事情是,我以前没有任何Lucene和我想知道它是否能以更简单的方式使这样的数字成为可能。

So the thing is, I don't have any previous experience with Lucene and I want to know if it makes such thing with such number possible in an easier way.

ps。我已经尝试将数据分块到几个字典并将它们写在磁盘上,但速度相对较慢。

ps. I've already tried chunking the data into several dictionaries and writing them on disk but it's relatively slow.

感谢您的帮助。

干杯
Parsa

Cheers Parsa

推荐答案

我想如果你想重用LingPipe的 ExactDictionaryChunker 做NER,你可以覆盖他们的MapDictionary来存储和放大;从您选择的键/值数据库而不是 ObjectToSet (顺便说一下,它扩展了HashMap)。

I suppose if you wanted to reuse LingPipe's ExactDictionaryChunker to do the NER, you could override their MapDictionary to store & retrieve from your choice of key/value database instead of their ObjectToSet (which does extend HashMap, by the way).

Lucene / solr可以用作键/值存储,但是如果你不需要额外的搜索功能,只需要一个纯粹的查找,其他选项可能对你正在做的事情更好。

Lucene/solr can be used as a key/value store, but if you don't need the extra searching capabilities, just a pure look-up, other options might be better for what you're doing.

这篇关于零编辑距离的基于字典的命名实体识别:LingPipe,Lucene还是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆