Spacy-令牌化带引号的字符串 [英] Spacy - Tokenize quoted string
问题描述
我正在使用spacy 2.0,并使用带引号的字符串作为输入.
I am using spacy 2.0 and using a quoted string as input.
示例字符串
"The quoted text 'AA XX' should be tokenized"
并希望提取
[The, quoted, text, 'AA XX', should, be, tokenized]
但是我在实验时得到了一些奇怪的结果.名词块和ents失去了其中一个引号.
I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.
import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])
结果
[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']
满足我的需求的最佳方法是什么
What is the best way to address what I need
推荐答案
虽然您可以修改令牌生成器并添加您自己的自定义前缀,后缀和不包含引号的中缀规则,但我不确定这是最好的解决方案.
While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.
对于您的用例,添加组件可能更有意义.到在调用标记器,解析器和实体识别器之前将(某些)带引号的字符串合并为一个令牌的管道.为此,您可以使用基于规则的Matcher
和查找由'
包围的标记的组合.以下模式查找一个或多个字母数字字符:
For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher
and find combinations of tokens surrounded by '
. The following pattern looks for one or more alphanumeric characters:
pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]
Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher
, add the pattern and write a function that takes a Doc
object, extracts the matched spans and merges them into one token by calling their .merge
method.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])
def quote_merger(doc):
# this will be called on the Doc object in the pipeline
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans: # merge into one token after collecting all matches
span.merge()
return doc
nlp.add_pipe(quote_merger, first=True) # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']
对于更优雅的解决方案,您还可以将组件重构为可重用的类,从而在其__init__
方法中设置匹配器(有关示例,请参阅文档.
For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__
method (see the docs for examples).
如果首先在管道中添加组件,则所有其他组件(如标记器,解析器和实体识别器)都只能看到重新标记的Doc
.这就是为什么您可能想编写更具体的模式,使其仅合并您关心的某些带引号的字符串的原因.在您的示例中,新标记的边界改善了预测-但是我还可以想到许多其他情况,它们没有这样做,特别是如果加引号的字符串较长且包含句子的很大一部分时.
If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc
. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.
这篇关于Spacy-令牌化带引号的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!