SpaCy-字内连字符.如何对待他们一个字? [英] SpaCy -- intra-word hyphens. How to treat them one word?

查看:67
本文介绍了SpaCy-字内连字符.如何对待他们一个字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是作为

结果

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']  

以下使用的第一个(r"[./]")和最后一个(r(.'.)")模式是什么?

What are the first (r"[./]") and the last (r"(.'.)") patterns used for in the following?

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

我希望拆分如下;

那个

Yahya

笔记本电脑盖

.

我希望spacy将连字符内的单词视为一个标记,而不会对其他拆分规则产生负面影响.

I want spacy to treat intra-hyphen word as one token without impacting negatively on other split rules.

那是Yahya的笔记本电脑保护套.3.14!"

"That is Yahya's laptop-cover. 3.14!"

["That","is","Yahya","s","laptop-cover",.","3.14",!"]( 期望 )

["That", "is", "Yahya", "'s", "laptop-cover", ".", "3.14", "!"] (EXPECTED)

默认情况下,

import spacy
nlp = spacy.load('en_core_web_md')
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
    print (token.text)

SpaCy提供;

["That", "is", "Yahya", "'s", "laptop", "-", "cover", ".", "3.14", "!"]

但是,

from spacy.util import compile_infix_regex
infixes = nlp.Defaults.prefixes + tuple([r"[-]~"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
    print (token.text)

给予;

["That", "is", "Yahya", "'", "s", "laptop-cover.", "3.14", "!"]

推荐答案

注意:要查看保留带连字符的单词的自定义标记生成器,请参见答案的底部.

NOTE: To see the custom tokenizer that keeps the hyphenated words see the botton of the answer.

此处,定义了一个自定义标记生成器,该标记生成器使用一组内置(nlp.Defaults.prefixes)和自定义([./][-]~(.'.))模式将文本标记为标记.

Here, a custom tokenizer is defined, that tokenizes text into tokens using a set of built-in (nlp.Defaults.prefixes) and custom ([./], [-]~, (.'.)) patterns.

nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")是元组连接操作,结果看起来像

The nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)") is tuple concatenation operation, the result looks like

('§', '%', '=', '—', '–', '\\+(?![0-9])', '…', '……', ',', ':', ';', '\\!', '\\?', '¿', '؟', '¡', '\\(', '\\)', '\\[', '\\]', '\\{', '\\}', '<', '>', '_', '#', '\\*', '&', '。', '?', '!', ',', '、', ';', ':', '~', '·', '।', '،', '؛', '٪', '\\.\\.+', '…', "\\'", '"', '"', '"', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '(', ')', '〔', '〕', '【', '】', '《', '》', '〈', '〉', '\\$', '£', '€', '¥', '฿', 'US\\$', 'C\\$', 'A\\$', '₽', '﷼', '₴', '[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF\\U0001FA60-\\U0001FA6D]', '[/.]', '-~', "(.'.)")

如您所见,这些都是正则表达式,用于处理单词标点,中缀.请参见 Spacy标记器算法:

As you see, these are all regular expressions and are used to process in-word punctuation, infixes. See the Spacy tokenizer algorithm:

算法可以总结如下:

The algorithm can be summarized as follows:

  1. 遍历以空格分隔的子字符串
  2. 检查我们是否对此子字符串有明确定义的规则.如果这样做,请使用它.
  3. 否则,请尝试使用前缀.
  4. 如果我们使用了前缀,请返回循环的开头,以使特殊情况始终得到优先处理.
  5. 如果我们不使用前缀,请尝试使用后缀.
  6. 如果我们不能使用前缀或后缀,请查找中缀"(例如连字符等).
  7. 一旦我们不再消耗字符串,就将其作为单个令牌处理.
  1. Iterate over space-separated substrings
  2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
  3. Otherwise, try to consume a prefix.
  4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
  5. If we didn’t consume a prefix, try to consume a suffix.
  6. If we can’t consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
  7. Once we can’t consume any more of the string, handle it as a single token.

现在,当我们处于中缀处理步骤时,这些正则表达式也将基于这些模式用于将文本拆分为标记.

Now, when we are at infix handling step, these regular expressions are used to split text into tokens based also on these patterns.

例如[/.]很重要,因为如果您不添加abc.def/ghi,则abc.def/ghi将是一个令牌,但是添加了模式后,它将被拆分为'abc', '.', 'def', '/', 'ghi'.

E.g. [/.] is important because if you do not add it, abc.def/ghi will be a single token, but with the pattern added, it will be split into 'abc', '.', 'def', '/', 'ghi'.

[-]~(与-~相同)与-相匹配,并希望紧随其后与~相匹配,但是由于它不存在,因此会跳过-并且不会发生拆分,您将获得整个'Marketing-Representative-'令牌.但是请注意,如果句子中包含'Marketing-~Representative-',并且使用-~正则表达式,则会得到['Marketing', '-~', 'Representative-']的结果,因为将有一个匹配项.

The [-]~ (that is the same as -~) matches a - and wants to match ~ right after, but since it is not there, the - is skipped and no split occurs, you get the whole 'Marketing-Representative-' token. Note, however, if you have 'Marketing-~Representative-' in the sentence, and you use -~ regex you will get ['Marketing', '-~', 'Representative-'] as a result as there will be a match.

.'.正则表达式匹配任何char + ' +任何char.点匹配正则表达式中的任何字符.因此,该规则只是将这些标记从句子中标记化(分离出来)(例如n't,r'd等)

The .'. regex matches any char + ' + any char. A dot matches any char in regex. So, the rule just tokenizes (splits out) these tokens out of the sentence (e.g. n't, r'd, etc.)

答案

添加新规则时应非常小心,并检查它们是否与已添加的规则不重叠.

You should be very careful when adding new rules and check if they do not overlap with already added ones.

例如当您添加r"\b's\b"以拆分通用案例撇号时,您应该覆盖" nlp.Defaults.prefixes中的"\\'"规则.如果您不打算将'作为前缀匹配,请删除它,或者通过将nlp.Defaults.prefixes附加到这些规则上来优先考虑自定义规则,反之亦然.

E.g. when you add r"\b's\b" to split out Genetive case apostrophe-s, you should "override" the "\\'" rule from nlp.Defaults.prefixes. Either remove it if you do not plan to match ' as infix, or give priority to your custom rules by appendng the nlp.Defaults.prefixes to those rules, not vice versa.

查看示例代码:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_md")
infixes = tuple([r"'s\b", r"(?<!\d)\.(?!\d)"]) +  nlp.Defaults.prefixes
infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])

输出:['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']

详细信息

  • r"'s\b"-与's匹配并带有单词边界
  • r"(?<!\d)\.(?!\d)-匹配在.之前或之后没有数字的位置.
  • r"'s\b" - matches 's that are followed with a word boundary
  • r"(?<!\d)\.(?!\d) - matches a . that is not preceded or followed with a digit.

并且如果您想使用保留带连字符的字母单词的自定义令牌生成器作为单个令牌,则必须重新定义行说明了这一点,您需要摆脱掉它.由于它是唯一包含-|–|—|--|---|——|~字符串的项目,因此将其从infixes中删除并重新编译中缀模式将更加容易:

And if you want to use a custom tokenizer that keeps hyphenated letter words as single tokens you will have to re-define the infixes: the r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), line accounts for that and you need to get rid of it. Since it is the only item that contains a -|–|—|--|---|——|~ string it will be easier to drop this item from the infixes and re-compile the infix pattern:

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")

inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
# => ['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']

这篇关于SpaCy-字内连字符.如何对待他们一个字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆