从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出 [英] Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format

查看:25
本文介绍了从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 NLTK 中使用 NER 来查找句子中的人物、地点和组织.我能够产生这样的结果:

I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]

使用它可以将事物组合在一起吗?我想要的是这样的:

Is that possible to chunk things together by using it? What I want is like this:

u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'

谢谢!

推荐答案

您可以使用标准的 NLTK 方式通过 nltk.Tree 来表示块.这可能意味着您必须稍微更改您的表示.

You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.

我通常做的是将NER标记的句子表示为三元组列表:

What I usually do is represent NER-tagged sentences as lists of triplets:

sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]

当我使用外部工具为句子标记 NER 时,我会这样做.现在您可以将这句话转换为 NLTK 表示:

I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:

from nltk import Tree


def IOB_to_tree(iob_tagged):
    root = Tree('S', [])
    for token in iob_tagged:
        if token[2] == 'O':
            root.append((token[0], token[1]))
        else:
            try:
                if root[-1].label() == token[2]:
                    root[-1].append((token[0], token[1]))
                else:
                    root.append(Tree(token[2], [(token[0], token[1])]))
            except:
                root.append(Tree(token[2], [(token[0], token[1])]))

    return root


sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)

表示形式的变化是有道理的,因为您当然需要 POS 标签来进行 NER 标记.

The change in representation kind of makes sense because you certainly need POS tags for NER tagging.

最终结果应该是这样的:

The end result should look like:

(S
  (PERSON Andrew/NNP)
  is/VBZ
  part/NN
  of/IN
  the/DT
  (ORGANIZATION Republican/NNP Party/NNP)
  in/IN
  (LOCATION Dallas/NNP))

这篇关于从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆