Spacy 分词器,添加分词器异常 [英] Spacy tokenizer, add tokenizer exception

查看:113
本文介绍了Spacy 分词器,添加分词器异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿!我试图在使用 spacy 2.02 标记一些令牌时添加一个例外,我知道存在 .tokenizer.add_special_case() 我在某些情况下使用它,但例如像 100 美元这样的令牌,spacy splits在两个令牌

Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token

('US$', 'SYM'), ('100', 'NUM')

('US$', 'SYM'), ('100', 'NUM')

但我想像这样分成三份,而不是对美元后面的每个数字都做一个特例,我想对每个格式为 US$NUMBER 的令牌进行例外处理.

But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER.

('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')

('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM')

我正在阅读有关 spacy 文档的 TOKENIZER_EXCEPTIONS 的内容,但我不知道如何做到这一点.

I was reading about the TOKENIZER_EXCEPTIONS on the documentation of spacy but I can't figure out how to this.

我正在尝试使用

从 spacy.lang.en.tokenizer_exceptions 导入 TOKENIZER_EXCEPTIONS还有 spacy.util 有一个方法 update_exc().

from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS and also spacy.util which have a method update_exc().

有人可以发布一个完整的代码示例吗?

Can someone post a full code example on how to do it?

哦,还有一件事,我知道 lang.en 上的文件 tokenizer_exceptions 已经有一些例外,例如在i"'m"中拆分i'm",我已经评论了该部分,但这不起作用.我不希望分词器拆分我是",我怎么也能做到这一点?

Oh, another thing, i know that the file tokenizer_exceptions on lang.en, has already some exceptions like split "i'm" in "i" "'m", i already commented that part but that won't work. I don't want that the tokenizer split "i'm", how i can also do this ?

谢谢

推荐答案

解决方案在这里

 def custom_en_tokenizer(en_vocab):  
 prefixes = list(English.Defaults.prefixes)
 prefixes.remove('US\$')  # Remove exception for currencies
 prefixes.append(r'(?:US)(?=\$\d+)')  # Append new prefix-matching rule

 prefix_re = util.compile_prefix_regex(tuple(prefixes))
 suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
 infix_re = util.compile_infix_regex(English.Defaults.infixes)

 return Tokenizer(en_vocab,
                  English.Defaults.tokenizer_exceptions,
                  prefix_re.search,
                  suffix_re.search,
                  infix_re.finditer,
                  token_match=None)

> tokenizer = custom_en_tokenizer(spacy.blank('en').vocab)
> for token in tokenizer('US$100'):
>      print(token, end=' ')

这篇关于Spacy 分词器,添加分词器异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆