如何为CoreNLP提供一些预先标记的命名实体? [英] How to feed CoreNLP some pre-labeled Named Entities?

查看:127
本文介绍了如何为CoreNLP提供一些预先标记的命名实体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Standford CoreNLP提取共指并开始处理预标记文本的依存关系.我最终希望在相关命名实体之间构建图节点和边.我在python中工作,但是直接使用nltk的java函数来调用"edu.stanford.nlp.pipeline.StanfordCoreNLP" jar(无论如何nltk都是在后台进行操作).

I want to use Standford CoreNLP to pull out Coreferences and start working on the Dependencies of pre-labeled text. I eventually hope to build graph nodes and edges between related Named Entities. I am working in python, but using nltk's java functions to call the "edu.stanford.nlp.pipeline.StanfordCoreNLP" jar directly (which is what nltk does behind the scenes anyway).

我的预先标记文本为以下格式:

My pre-labeled text is in this format:

PRE-LABELED:  During his youth, [PERSON: Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16.  Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela].  He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety.

我试图做的是自己标记我的句子,以IOB格式建立元组列表:[("During","O"),("his","O"),("youth","O"),(亚历山大","B-PERSON"),("III","I-PERSON"),...]

What I tried to do is tokenize my sentences myself, building a list of tuples in IOB format: [ ("During","O"), ("his","O"), ("youth","O"), ("Alexander","B-PERSON"), ("III","I-PERSON"), ...]

但是,我不知道如何告诉CoreNLP以该元组列表为起点,构建最初没有标记的其他命名实体,并在这些新的,更高质量的标记化句子中找到共指.我显然尝试过简单地剥离标签,然后让CoreNLP自己执行此操作,但是CoreNLP在查找命名实体方面不如人工标记的预标记文本那样好.

However, I can't figure out how to tell CoreNLP to take this tuple list as a starting point, building additional Named Entities that weren't initially labeled and finding coreferences on these new, higher-quality tokenized sentences. I obviously tried simply striping out my labels, and letting CoreNLP do this by itself, but CoreNLP is just not as good at finding the Named Entities as the human-tagged pre-labeled text.

我需要如下输出.我知道使用依赖关系以这种方式获取Edge会很困难,但是我需要看看能得到多远.

I need an output as below. I understand that it will be difficult to use Dependencies to get Edges in this way, but I need to see how far I can get.

DESIRED OUTPUT:
[Person 1]:
Name: Alexander III of Macedon
Mentions:
* "Alexander III of Macedon"; Sent1 [4,5,6,7] # List of tokens
* "Alexander"; Sent2 [6]
* "He"; Sent3 [1]
Edges:
* "Person 2"; "tutored by"; "Aristotle"

[Person 2]:
Name: Aristotle
[....]

我如何向CoreNLP提供一些预先定义的命名实体,并仍然获得有关其他命名实体,共指以及基本依赖项的帮助?

P.S.请注意,这不是带有自定义数据的NLTK命名实体识别.我不是在尝试使用预先标记的NER来训练新的分类器,而是在运行共指(包括提及)和对给定句子的依存关系时尝试将CoreNLP添加到我自己的分类器中.

P.S. Note that this is not a duplicate of NLTK Named Entity Recognition with Custom Data. I'm not trying to train a new classifier with my pre-labeled NER, I'm only trying to add CoreNLP's to my own when running coreference (including mentions) and dependencies on a given sentence.

推荐答案

答案是使用

The answer is to make a Rules file with Additional TokensRegexNER Rules.

我使用了正则表达式将标签名称分组.由此,我构建了一个规则临时文件,并使用-ner.additional.regexner.mapping mytemprulesfile将该文件传递给corenlp jar.

I used a regex to group out the labeled names. From this I built a rules tempfile which I passed to the corenlp jar with -ner.additional.regexner.mapping mytemprulesfile.

Alexander III of Macedon    PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Aristotle                   PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Anatolia                    LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Alexander                   PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Persia                      LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Issus                       LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Gaugamela                   LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Persian King Darius III     PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Achaemenid Empire           ORGANIZATION    PERSON,LOCATION,ORGANIZATION,MISC

我已将此列表对齐以提高可读性,但这是制表符分隔的值.

一个有趣的发现是,某些预先标记了多个单词的实体保留了最初标记的多个单词,而在没有规则文件的情况下运行corenlp有时会将这些标记拆分为单独的实体.

An interesting finding is that some multi-word pre-labeled entities stay multi-word as originally labeled, whereas running corenlp without the rules files will sometimes split these tokens into separate entities.

我本来想专门标识命名实体令牌,弄清楚它会使共同引用更加容易,但是我想现在这样做就可以了.无论如何,实体名称在一个文档中有多少次相同但不相关?

I had wanted to specifically identify the named-entity tokens, figuring it would make coreferences easier, but I guess this will do for now. How often are entity names identical but unrelated within one document, anyway?

示例 (执行大约需要70秒)

import os, re, tempfile, json, nltk, pprint
from subprocess import PIPE
from nltk.internals import (
    find_jar_iter,
    config_java,
    java,
    _java_options,
    find_jars_within_path,
)

def ExtractLabeledEntitiesByRegex( text, regex ):
    rgx = re.compile(regex)
    nelist = []
    for mobj in rgx.finditer( text ):
        ne = mobj.group('ner')
        try:
            tag = mobj.group('tag')
        except IndexError:
            tag = 'PERSON'
        mstr = text[mobj.start():mobj.end()]
        nelist.append( (ne,tag,mstr) )
    cleantext = rgx.sub("\g<ner>", text)
    return (nelist, cleantext)

def GenerateTokensNERRules( nelist ):
    rules = ""
    for ne in nelist:
        rules += ne[0]+'\t'+ne[1]+'\tPERSON,LOCATION,ORGANIZATION,MISC\n'
    return rules

def GetEntities( origtext ):
    nelist, cleantext = ExtractLabeledEntitiesByRegex( origtext, '(\[(?P<tag>[a-zA-Z]+)\:\s*)(?P<ner>(\s*\w)+)(\s*\])' )

    origfile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
    origfile.write( cleantext.encode('utf-8') )
    origfile.flush()
    origfile.seek(0)
    nerrulefile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
    nerrulefile.write( GenerateTokensNERRules(nelist).encode('utf-8') )
    nerrulefile.flush()
    nerrulefile.seek(0)

    java_options='-mx4g'
    config_java(options=java_options, verbose=True)
    stanford_jar = '../stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar'
    stanford_dir = os.path.split(stanford_jar)[0]
    _classpath = tuple(find_jars_within_path(stanford_dir))

    cmd = ['edu.stanford.nlp.pipeline.StanfordCoreNLP',
        '-annotators','tokenize,ssplit,pos,lemma,ner,parse,coref,coref.mention,depparse,natlog,openie,relation',
        '-ner.combinationMode','HIGH_RECALL',
        '-ner.additional.regexner.mapping',nerrulefile.name,
        '-coref.algorithm','neural',
        '-outputFormat','json',
        '-file',origfile.name
        ]

    # java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )
    stdout, stderr = java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )    # Couldn't get working- stdin=textfile
    PrintJavaOutput( stdout, stderr )

    origfilenametuple = os.path.split(origfile.name)
    jsonfilename = origfilenametuple[len(origfilenametuple)-1] + '.json'

    os.unlink( origfile.name )
    os.unlink( nerrulefile.name )
    origfile.close()
    nerrulefile.close()

    with open( jsonfilename ) as jsonfile:
        jsondata = json.load(jsonfile)

    currentid = 0
    entities = []
    for sent in jsondata['sentences']:
        for thisentity in sent['entitymentions']:
            tag = thisentity['ner']
            if tag == 'PERSON' or tag == 'LOCATION' or tag == 'ORGANIZATION':
                entity = {
                    'id':currentid,
                    'label':thisentity['text'],
                    'tag':tag
                }
                entities.append( entity )
                currentid += 1

    return entities

#### RUN ####
corpustext = "During his youth, [PERSON:Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16.  Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela].  He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety."

entities = GetEntities( corpustext )
for thisent in entities:
    pprint.pprint( thisent )

输出

{'id': 0, 'label': 'Alexander III of Macedon', 'tag': 'PERSON'}
{'id': 1, 'label': 'Aristotle', 'tag': 'PERSON'}
{'id': 2, 'label': 'his', 'tag': 'PERSON'}
{'id': 3, 'label': 'Anatolia', 'tag': 'LOCATION'}
{'id': 4, 'label': 'Alexander', 'tag': 'PERSON'}
{'id': 5, 'label': 'Persia', 'tag': 'LOCATION'}
{'id': 6, 'label': 'Issus', 'tag': 'LOCATION'}
{'id': 7, 'label': 'Gaugamela', 'tag': 'LOCATION'}
{'id': 8, 'label': 'Persian King Darius III', 'tag': 'PERSON'}
{'id': 9, 'label': 'Achaemenid Empire', 'tag': 'ORGANIZATION'}
{'id': 10, 'label': 'He', 'tag': 'PERSON'}

这篇关于如何为CoreNLP提供一些预先标记的命名实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆