具有正则表达式的命名实体识别:NLTK [英] Named Entity Recognition with Regular Expression: NLTK

查看：179 发布时间：2020/5/18 0:33:56 regex nlp nltk named-entity-recognition

本文介绍了具有正则表达式的命名实体识别:NLTK的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在使用NLTK工具包.我经常遇到这个问题，并在网上搜索解决方案，但是没有一个令人满意的答案.因此，我将查询放在这里.

I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.

很多时候，NER不会将连续的NNP标记为一个NE.我认为编辑NER以使用RegexpTagger也可以改善NER.

Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.

示例:

输入:

巴拉克·奥巴马(Barack Obama)是一个伟大的人.

Barack Obama is a great person.

输出:

Tree('S'，[Tree('PERSON'，[('Barack'，'NNP')])，Tree('ORGANIZATION'，[('Obama'，'NNP')])，('是'，'VBZ')，('a'，'DT')，('great'，'JJ')，('person'，'NN')，('.'，'.')]))

Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])

与

输入:

前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉哈姆说，他很荣幸"，而在办公室被比作达斯维达.

Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.

输出:

Tree('S'，[('Former'，'JJ')，('Vice'，'NNP')，('President'，'NNP')，Tree('NE'，[('Dick '，'NNP')，('Cheney'，'NNP')])，('told'，'VBD')，('conservative'，'JJ')，('radio'，'NN')，( 'host'，'NN')，Tree('NE'，[('Laura'，'NNP')，('Ingraham'，'NNP')])，('that'，'IN')，('他'，'PRP')，('', '')，('was'，'VBD')，('荣誉'，'VBN')，(''"，''")，('到'，'TO')，('be'，'VB')，('比较'，'VBN')，('to'，'TO')，Tree('NE'，[('Darth'， ''NNP')，('Vader'，'NNP')])，('while'，'IN')，('in'，'IN')，('office'，'NN')，('. '，'.')])

Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('', ''), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])

此处正确提取了副总裁/NNP，总裁/NNP(迪克/NNP，切尼/NNP).

Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.

因此，我认为如果首先使用nltk.ne_chunk，然后如果连续两棵树是NNP，则很有可能两者都引用一个实体.

So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.

任何建议将不胜感激.我正在寻找方法上的缺陷.

Any suggestion will be really appreciated. I am looking for flaws in my approach.

谢谢.

推荐答案

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[输出]:

['Barack Obama']

但是请注意，如果连续的块不应该是单个网元，那么您将把多个网元组合成一个网元.我想不出这样的例子，但我相信它会发生.但是，如果它们不是连续的，则上面的脚本可以正常工作:

But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

这篇关于具有正则表达式的命名实体识别:NLTK的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

具有正则表达式的命名实体识别:NLTK [英] Named Entity Recognition with Regular Expression: NLTK

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

具有正则表达式的命名实体识别:NLTK [英] Named Entity Recognition with Regular Expression: NLTK

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭