如何在标注器之前/之后在 spacy 中强制使用 pos 标签? [英] How to force a pos tag in spacy before/after tagger?

查看:46
本文介绍了如何在标注器之前/之后在 spacy 中强制使用 pos 标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我处理这个句子

'将目标卡返回你的手上'

'Return target card to your hand'

使用 spacy 和 en_web_core_lg 模型,它识别令牌如下:

with spacy and the en_web_core_lg model, it recognize the tokens as below:

将 NOUN 目标 NOUN 卡 NOUN 返回到 ADP 您的 ADJ 手 NOUN

Return NOUN target NOUN card NOUN to ADP your ADJ hand NOUN

如何强制将Return"标记为动词?以及如何在解析器之前执行此操作,以便解析器可以更好地解释标记之间的关系?

How can I force 'Return' to be tagged as a VERB? And how can I do it before the parser, so that the parser can better interpret relations between tokens?

在其他情况下这会很有用.我正在处理包含特定符号的文本,例如 {G}.这三个字符应该被视为一个名词,作为一个整体,而 {T} 应该是一个动词.但是现在我不知道如何实现这一点,如果不开发用于标记和标记的新模型.如果我可以强制"一个标记,我可以将这些符号替换为可以被识别为一个标记的东西,并强制它被适当地标记.例如,我可以用 SYMBOLG 替换 {G} 并强制将 SYMBOLG 标记为 NOUN.

There are other situations in which this would be useful. I am dealing with text which contains specific symbols such as {G}. These three characters should be considered a NOUN, as a whole, and {T} should be a VERB. But right now I do not know how to achieve that, without developing a new model for tokenizing and for tagging. If I could "force" a token, I could replace these symbols for something that would be recognized as one token and force it to be tagged appropriately. For example, I could replace {G} with SYMBOLG and force tagging SYMBOLG as NOUN.

推荐答案

此解决方案使用了 spaCy 2.0.12 (IIRC).

this solution used spaCy 2.0.12 (IIRC).

要回答问题的第二部分,您可以向标记器添加特殊标记化规则,如文档 此处.假设这些符号是明确的,以下代码应该可以满足您的需求:

To answer the second part of your question, you can add special tokenisation rules to the tokeniser, as stated in the docs here. The following code should do what you want, assuming those symbols are unambiguous:

import spacy

from spacy.symbols import ORTH, POS, NOUN, VERB

nlp = spacy.load('en')

nlp.tokenizer.add_special_case('{G}', [{ORTH: '{G}', POS: NOUN}])
nlp.tokenizer.add_special_case('{T}', [{ORTH: '{T}', POS: VERB}])

doc = nlp('This {G} a noun and this is a {T}')

for token in doc:
    print('{:10}{:10}'.format(token.text, token.pos_))

输出为(标签不正确,但这表明已应用特殊情况规则):

Output for this is (the tags are not correct, but this shows the special case rules have been applied):

This      DET       
{G}       NOUN      
a         DET       
noun      NOUN      
and       CCONJ     
this      DET       
is        VERB      
a         DET       
{T}       VERB      

至于您问题的第一部分,将词性分配给单个词的问题在于,它们大多在上下文之外是模棱两可的(例如返回"名词或动词?).所以上面的方法不会让你考虑上下文中的使用,并且很可能会产生错误.spaCy 确实允许您进行基于令牌的模式匹配,因此值得拥有一看.也许有一种方法可以满足您的需求.

As for the first part of your question, the problem with assigning a part-of-speech to individual words is that they are mostly ambiguous out of context (e.g. "return" noun or verb?). So the above method would not allow you to account for use in context and is likely to generate errors. spaCy does allow you to do token-based pattern matching however, so that is worth having a look at. Maybe there is a way to do what you're after.

这篇关于如何在标注器之前/之后在 spacy 中强制使用 pos 标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆