只有“空白"的空间分词器规则 [英] Spacy tokenizer with only "Whitespace" rule

查看:60
本文介绍了只有“空白"的空间分词器规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道 spacy 分词器是否可以仅使用空格"对单词进行分词.规则.例如:

I would like to know if the spacy tokenizer could tokenize words only using the "space" rule. For example:

sentence= "(c/o Oxford University )"

通常使用spacy的如下配置:

Normally, using the following configuration of spacy:

nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
   print(token)

结果是:

 (
 c
 /
 o
 Oxford
 University
 )

相反,我想要如下输出(使用 spacy):

Instead, I would like an output like the following (using spacy):

(c/o 
Oxford 
University
)

是否可以使用 spacy 获得这样的结果?

Is it possible to obtain a result like this using spacy?

推荐答案

让我们用自定义的 Tokenizertoken_match 正则表达式:

Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])

nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])


Before: [This, is, it, 's]
After : [This, is, it's]

您可以通过添加自定义后缀、前缀和中缀规则来进一步调整 Tokenizer.

You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.

另一种更细粒度的方法是找出为什么 it's 标记像 nlp.tokenizer.explain() 一样被拆分:

An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)


你会发现拆分是由于SPECIAL规则:

[('TOKEN', 'This'),
 ('TOKEN', 'is'),
 ('SPECIAL-1', 'it'),
 ('SPECIAL-2', "'s"),
 ('SUFFIX', '.'),
 ('SPECIAL-1', 'I'),
 ('SPECIAL-2', "'m"),
 ('TOKEN', 'fine')]

可以更新以删除它"来自以下异常:

that could be updated to remove "it's" from exceptions like:

exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]


[This, is, it's., I, 'm, fine]

或完全删除撇号上的拆分:

or remove split on apostrophe altogether:

filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]


[This, is, it's., I'm, fine]

注意标记上的点,这是由于未指定后缀规则所致.

Note the dot attached to the token, which is due to the suffix rules not specified.

这篇关于只有“空白"的空间分词器规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆