块状斯坦福命名实体识别器(NER)以NLTK格式输出 [英] Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format

查看:99
本文介绍了块状斯坦福命名实体识别器(NER)以NLTK格式输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在NLTK中使用NER查找句子中的人物,位置和组织.我能够产生这样的结果:

I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]

是否可以通过使用将它们打包在一起? 我想要的是这样的:

Is that possible to chunk things together by using it? What I want is like this:

u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'

谢谢!

推荐答案

您可以使用标准的NLTK方式,通过 nltk.Tree 表示块.这可能意味着您必须稍微更改一下表示形式.

You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.

我通常要做的是将 NER标记的句子表示为三元组列表:

What I usually do is represent NER-tagged sentences as lists of triplets:

sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]

当我使用外部工具来为句子加标签时,我会这样做.现在,您可以将此句子转换为NLTK表示形式:

I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:

from nltk import Tree


def IOB_to_tree(iob_tagged):
    root = Tree('S', [])
    for token in iob_tagged:
        if token[2] == 'O':
            root.append((token[0], token[1]))
        else:
            try:
                if root[-1].label() == token[2]:
                    root[-1].append((token[0], token[1]))
                else:
                    root.append(Tree(token[2], [(token[0], token[1])]))
            except:
                root.append(Tree(token[2], [(token[0], token[1])]))

    return root


sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)

表示形式上的改变很有意义,因为您当然需要POS标签进行NER标签.

The change in representation kind of makes sense because you certainly need POS tags for NER tagging.

最终结果应类似于:

(S
  (PERSON Andrew/NNP)
  is/VBZ
  part/NN
  of/IN
  the/DT
  (ORGANIZATION Republican/NNP Party/NNP)
  in/IN
  (LOCATION Dallas/NNP))

这篇关于块状斯坦福命名实体识别器(NER)以NLTK格式输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆