正则表达式命名实体识别:NLTK [英] Named Entity Recognition with Regular Expression: NLTK
问题描述
我一直在玩 NLTK 工具包.我经常遇到这个问题并在网上搜索解决方案,但没有得到满意的答案.所以我把我的查询放在这里.
很多时候 NER 不会将连续的 NNP 标记为一个 NE.我认为编辑 NER 使用 RegexpTagger 也可以改进 NER.
示例:
输入:
<块引用>巴拉克奥巴马是一个伟大的人.
输出:
<块引用>Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])
在哪里
输入:
<块引用>前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉汉姆,他在任期间很荣幸"被与达斯·维德相提并论.
输出:
<块引用>Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('', '
'), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])
这里 Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) 被正确提取.
所以我认为如果先使用 nltk.ne_chunk 然后如果两个连续的树都是 NNP,那么两者都指向一个实体的可能性很高.
任何建议将不胜感激.我正在寻找方法中的缺陷.
谢谢.
from nltk import ne_chunk, pos_tag, word_tokenize从 nltk.tree 导入树def get_continuous_chunks(text):分块 = ne_chunk(pos_tag(word_tokenize(text)))上一个 = 无连续块 = []current_chunk = []因为我是分块的:如果 type(i) == 树:current_chunk.append(" ".join([token for token, pos in i.leaves()]))elif current_chunk:named_entity = " ".join(current_chunk)如果named_entity 不在continuous_chunk 中:continue_chunk.append(named_entity)current_chunk = []别的:继续如果连续_块:named_entity = " ".join(current_chunk)如果named_entity 不在continuous_chunk 中:continue_chunk.append(named_entity)返回continuous_chunktxt = "巴拉克奥巴马是一个伟大的人."打印 get_continuous_chunks(txt)
[输出]:
['巴拉克奥巴马']
但请注意,如果连续块不应该是单个 NE,那么您会将多个 NE 合并为一个.我想不出这样的例子,但我相信它会发生.但是如果它们不连续,上面的脚本就可以正常工作:
<预><代码>>>>txt = "巴拉克奥巴马是米歇尔奥巴马的丈夫.">>>get_continuous_chunks(txt)['巴拉克奥巴马','米歇尔奥巴马']I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.
Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.
Example:
Input:
Barack Obama is a great person.
Output:
Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])
where as
input:
Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.
Output:
Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('
', '
'), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])
Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.
So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.
Any suggestion will be really appreciated. I am looking for flaws in my approach.
Thanks.
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)
[out]:
['Barack Obama']
But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:
>>> txt = "Barack Obama is the husband of Michelle Obama."
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']
这篇关于正则表达式命名实体识别:NLTK的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!