块状斯坦福命名实体识别器(NER)以NLTK格式输出 [英] Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format
问题描述
我正在NLTK中使用NER查找句子中的人物,位置和组织.我能够产生这样的结果:
I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
是否可以通过使用将它们打包在一起? 我想要的是这样的:
Is that possible to chunk things together by using it? What I want is like this:
u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'
谢谢!
推荐答案
您可以使用标准的NLTK方式,通过 nltk.Tree 表示块.这可能意味着您必须稍微更改一下表示形式.
You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.
我通常要做的是将 NER标记的句子表示为三元组列表:
What I usually do is represent NER-tagged sentences as lists of triplets:
sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
当我使用外部工具来为句子加标签时,我会这样做.现在,您可以将此句子转换为NLTK表示形式:
I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:
from nltk import Tree
def IOB_to_tree(iob_tagged):
root = Tree('S', [])
for token in iob_tagged:
if token[2] == 'O':
root.append((token[0], token[1]))
else:
try:
if root[-1].label() == token[2]:
root[-1].append((token[0], token[1]))
else:
root.append(Tree(token[2], [(token[0], token[1])]))
except:
root.append(Tree(token[2], [(token[0], token[1])]))
return root
sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)
表示形式上的改变很有意义,因为您当然需要POS标签进行NER标签.
The change in representation kind of makes sense because you certainly need POS tags for NER tagging.
最终结果应类似于:
(S
(PERSON Andrew/NNP)
is/VBZ
part/NN
of/IN
the/DT
(ORGANIZATION Republican/NNP Party/NNP)
in/IN
(LOCATION Dallas/NNP))
这篇关于块状斯坦福命名实体识别器(NER)以NLTK格式输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!