Spacy自定义标记生成器,使用Infix正则表达式仅包含连字符作为标记 [英] Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

查看:119
本文介绍了Spacy自定义标记生成器,使用Infix正则表达式仅包含连字符作为标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Spacy中加入连字符,例如:长期,自尊等.在Stackoverflow上查看了一些类似的帖子之后, Github ,其

I want to include hyphen words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on Stackoverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below.

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\"\"\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of "medicine" has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

因此,这句话: '注:自十四世纪以来,医学"的实践已成为一种职业;更重要的是,这是男性主导的职业.'

So for this sentence: 'Note: Since the fourteenth century the practice of "medicine" has become a profession; and more importantly, it\'s a male-dominated profession.'

现在,合并自定义Spacy令牌生成器后的令牌为:

注",:",自","the",第十四",世纪","the",实践","of",""药物",'" ",具有",;",成为","a",专业",,",和",更多",重要",, ',是",,"a","男性主导",专业",."

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '"medicine', '"', 'has', ';', 'become', 'a', 'profession', ',', 'and', 'more', 'importantly', ',', "it's", 'a', 'male-dominated', 'profession', '.'

之前,此更改之前的令牌为:

注",:",自","the",第十四",世纪","the",实践","of","" ", "药物","" ,有",成为",一个",专业",;",和",更多", 重要",,","","","a","男性"," -","主导",专业",."

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '"', 'medicine', '"', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male', '-', 'dominated', 'profession', '.'

而且,预​​期令牌应为:

注",:",自","the",第十四",世纪","the",实践","of","" ", "药物","" ,有",成为",一个",专业",;",和",更多", 重要",,","","'s ","a","男性主导",专业','.'

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '"', 'medicine', '"', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.'

可以看到连字符被包括在内,除双引号和撇号外的其他标点符号也被包括在内. 但是现在,单引号和双引号没有更早的行为或预期的行为.我为Infix尝试了正则表达式的不同排列和组合,但没有解决此问题的进度.因此,我们将不胜感激.

As one can see the hyphen word is included and so are the other punctuation marks except for the double quotes and apostrophe. But now, the apostrophe and double quotes are not having the earlier or expected behaviour. I have tried different permutations and combinations for the regex compile for the Infix but no progress to fix this issue. Hence, any help would be highly appreciated.

推荐答案

使用默认的prefix_re和suffix_re为我提供了预期的输出:

Using the default prefix_re and suffix_re gives me the expected output:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\"\"\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of "medicine" has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

[注意",:",自",该",第十四",世纪",该",实践",的",",药物"," ','有','成为','a','专业',';','和','更多','重要',',','它','s','a' ,男性主导",专业",."]

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '"', 'medicine', '"', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

如果您想深入了解正则表达式为什么不能像SpaCy那样工作,请访问相关源代码的链接:

If you want to dig into to why your regexes weren't working like SpaCy's, here are links to the relevant source code:

此处定义的前缀和后缀:

Prefixes and suffixes defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

参考此处定义的字符(例如,引号,连字符等):

With reference to characters (e.g, quotes, hyphens, etc.) defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

以及用于编译它们的函数(例如compile_prefix_regex):

And the functions used to compile them (e.g., compile_prefix_regex):

https://github.com/explosion/spaCy/blob/master/spacy/util.py

这篇关于Spacy自定义标记生成器,使用Infix正则表达式仅包含连字符作为标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆