如何在没有IOB标签的情况下使用Hugging Face的变形器管道重构文本实体? [英] How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

查看:413
本文介绍了如何在没有IOB标签的情况下使用Hugging Face的变形器管道重构文本实体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用Hugging Face的管道进行NER(命名为实体识别).但是,它以内部-外部-开始(IOB)格式返回实体标签,但没有IOB标签.因此,我无法将管道的输出映射回我的原始文本.此外,输出以BERT令牌化格式屏蔽(默认模型为BERT-large).

I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large).

例如:

from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))

输出为:

[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]

如您所见,纽约分为两个标签.

As you can see, New York is broken up into two tags.

如何将Hugging Face的NER管道映射回我的原始文本?

How can I map Hugging Face's NER Pipeline back to my original text?

变形金刚版本:2.7

Transformers version: 2.7

推荐答案

不幸的是,到目前为止(2.6版,我认为即使是2.7版),仅凭pipeline功能也无法做到这一点.由于管道调用的__call__函数仅返回列表,请参见

Unfortunately, as of now (version 2.6, and I think even with 2.7), you cannot do that with the pipeline feature alone. Since the __call__ function invoked by the pipeline is just returning a list, see the code here. This means you'd have to do a second tokenization step with an "external" tokenizer, which defies the purpose of the pipelines altogether.

但是,相反,您可以使用发布在文档,就在与您相似的示例下方.为了将来的完整性,下面是代码:

But, instead, you can make use of the second example posted on the documentation, just below the sample similar to yours. For the sake of future completeness, here is the code:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

这正好返回您要查找的内容.请注意,ConLL注释方案在其原始文件中列出了以下内容:

This is returning exactly what you are looking for. Note that the ConLL annotation scheme lists the following in its original paper:

每行包含四个字段:单词,词性标签,块标签和命名实体标签.用O标记的单词在命名实体之外,而I-XXX标记用于XXX类型的命名实体中的单词.每当XXX类型的两个实体彼此紧邻时,第二个实体的第一个单词将被标记为B-XXX,以表明它启动了另一个实体.数据包含四种类型的实体:人员(PER),组织(ORG),位置(LOC)和其他名称(MISC).这种标记方案是Ramshaw和Marcus(1995)最初提出的IOB方案.

Each line contains four fields: the word, its part-of-speech tag, its chunk tag and its named entity tag. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The data contains entities of four types: persons (PER),organizations (ORG), locations (LOC) and miscellaneous names (MISC). This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995).

意思是,如果对(仍然拆分的)实体不满意,则可以将所有后续的I-标记的实体或B-后跟I-标记连接起来.在这种方案中,不可能仅使用I-标签来标记两个不同的(紧邻的)实体.

Meaning, if you are unhappy with the (still split) entities, you can concatenate all the subsequent I- tagged entities, or B- followed by I- tags. It is not possible in this scheme that two different (immediately neighboring) entities are both tagged with only the I- tags.

这篇关于如何在没有IOB标签的情况下使用Hugging Face的变形器管道重构文本实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆