是否可以更改 Spacy 标记器的标记拆分规则? [英] Is it possible to change the token split rules for a Spacy tokenizer?

查看:33
本文介绍了是否可以更改 Spacy 标记器的标记拆分规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

默认情况下,(德语)spacy 标记器不会在斜杠、下划线或星号上拆分,这正是我所需要的(因此der/die"会生成单个标记).

The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token).

但是它确实在括号上拆分,因此dies(und)das"被拆分为 5 个标记.有没有一种(简单的)方法来告诉默认标记器也不要在括号上拆分,括号的两边用字母括起来,没有空格?

However it does split on parentheses so "dies(und)das" gets split into 5 tokens. Is there a (simple) way to tell the default tokeniser to also not split on parentheses which are enclosed by letters on both sides without a space?

为分词器定义的括号上的分割究竟是怎样的?

How exactly are those splits on parentheses defined for a tokenizer?

推荐答案

括号上的拆分在这一行中定义,它在两个字母之间的括号上拆分:

The split on parentheses is defined in this line, where it splits on a parenthesis between two letters:

https://github.com/explosion/spaCy/blob/23ec07debdd568f09c7c83b10564850f9fa67ad4/spacy/lang/de/punctuation.py#L18

没有删除中缀模式的简单方法,但您可以定义一个自定义标记器来执行您想要的操作.一种方法是从 spacy/lang/de/punctuation.py 复制中缀定义并修改它:

There's no simple way to remove infix patterns, but you can define a custom tokenizer that does what you want. One way is to copy the infix definition from spacy/lang/de/punctuation.py and modify it:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.lang.de.punctuation import _quotes
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
            r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
            r'(?<=[{a}])[:<>=](?=[{a}])'.format(a=ALPHA),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            r"(?<=[{a}])([{q}\]\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
            r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
            r"(?<=[0-9])-(?=[0-9])",
        ]
    )

    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)


nlp = spacy.load('de')
nlp.tokenizer = custom_tokenizer(nlp)

这篇关于是否可以更改 Spacy 标记器的标记拆分规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆