Spacy-令牌化带引号的字符串 [英] Spacy - Tokenize quoted string

查看:114
本文介绍了Spacy-令牌化带引号的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spacy 2.0,并使用带引号的字符串作为输入.

I am using spacy 2.0 and using a quoted string as input.

示例字符串

"The quoted text 'AA XX' should be tokenized"

并希望提取

[The, quoted, text, 'AA XX', should, be, tokenized]

但是我在实验时得到了一些奇怪的结果.名词块和ents失去了其中一个引号.

I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.

import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])

结果

[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']

满足我的需求的最佳方法是什么

What is the best way to address what I need

推荐答案

虽然您可以修改令牌生成器并添加您自己的自定义前缀,后缀和不包含引号的中缀规则,但我不确定这是最好的解决方案.

While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.

对于您的用例,添加组件可能更有意义.到在调用标记器,解析器和实体识别器之前将(某些)带引号的字符串合并为一个令牌的管道.为此,您可以使用基于规则的Matcher 和查找由'包围的标记的组合.以下模式查找一个或多个字母数字字符:

For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:

pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]

Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']

对于更优雅的解决方案,您还可以将组件重构为可重用的类,从而在其__init__方法中设置匹配器(有关示例,请参阅文档.

For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).

如果首先在管道中添加组件,则所有其他组件(如标记器,解析器和实体识别器)都只能看到重新标记的Doc.这就是为什么您可能想编写更具体的模式,使其仅合并您关心的某些带引号的字符串的原因.在您的示例中,新标记的边界改善了预测-但是我还可以想到许多其他情况,它们没有这样做,特别是如果加引号的字符串较长且包含句子的很大一部分时.

If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.

这篇关于Spacy-令牌化带引号的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆