Spacy 分词器,添加分词器异常 [英] Spacy tokenizer, add tokenizer exception
问题描述
嘿!我试图在使用 spacy 2.02 标记一些令牌时添加一个例外,我知道存在 .tokenizer.add_special_case() 我在某些情况下使用它,但例如像 100 美元这样的令牌,spacy splits在两个令牌
Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token
('US$', 'SYM'), ('100', 'NUM')
('US$', 'SYM'), ('100', 'NUM')
但我想像这样分成三份,而不是对美元后面的每个数字都做一个特例,我想对每个格式为 US$NUMBER 的令牌进行例外处理.
But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER.
('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')
('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')
我正在阅读有关 spacy 文档的 TOKENIZER_EXCEPTIONS 的内容,但我不知道如何做到这一点.
I was reading about the TOKENIZER_EXCEPTIONS on the documentation of spacy but I can't figure out how to this.
我正在尝试使用
从 spacy.lang.en.tokenizer_exceptions 导入 TOKENIZER_EXCEPTIONS还有 spacy.util 有一个方法 update_exc().
from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS and also spacy.util which have a method update_exc().
有人可以发布一个完整的代码示例吗?
Can someone post a full code example on how to do it?
哦,还有一件事,我知道 lang.en 上的文件 tokenizer_exceptions 已经有一些例外,例如在i"'m"中拆分i'm",我已经评论了该部分,但这不起作用.我不希望分词器拆分我是",我怎么也能做到这一点?
Oh, another thing, i know that the file tokenizer_exceptions on lang.en, has already some exceptions like split "i'm" in "i" "'m", i already commented that part but that won't work. I don't want that the tokenizer split "i'm", how i can also do this ?
谢谢
推荐答案
解决方案在这里
def custom_en_tokenizer(en_vocab):
prefixes = list(English.Defaults.prefixes)
prefixes.remove('US\$') # Remove exception for currencies
prefixes.append(r'(?:US)(?=\$\d+)') # Append new prefix-matching rule
prefix_re = util.compile_prefix_regex(tuple(prefixes))
suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
infix_re = util.compile_infix_regex(English.Defaults.infixes)
return Tokenizer(en_vocab,
English.Defaults.tokenizer_exceptions,
prefix_re.search,
suffix_re.search,
infix_re.finditer,
token_match=None)
> tokenizer = custom_en_tokenizer(spacy.blank('en').vocab)
> for token in tokenizer('US$100'):
> print(token, end=' ')
这篇关于Spacy 分词器,添加分词器异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!