SpaCy-字内连字符.如何对待他们一个字? [英] SpaCy -- intra-word hyphens. How to treat them one word?
问题描述
以下是作为
结果 以下使用的第一个(r"[./]")和最后一个(r(.'.)")模式是什么? What are the first (r"[./]") and the last (r"(.'.)") patterns used for in the following? 我希望拆分如下; 那个 是 Yahya 的 笔记本电脑盖 . 我希望spacy将连字符内的单词视为一个标记,而不会对其他拆分规则产生负面影响. I want spacy to treat intra-hyphen word as one token without impacting negatively on other split rules. 那是Yahya的笔记本电脑保护套.3.14!" "That is Yahya's laptop-cover. 3.14!" ["That","is","Yahya","s","laptop-cover",.","3.14",!"]( 期望 ) ["That", "is", "Yahya", "'s", "laptop-cover", ".", "3.14", "!"] (EXPECTED) 默认情况下, SpaCy提供; 但是, 给予;
注意:要查看保留带连字符的单词的自定义标记生成器,请参见答案的底部. NOTE: To see the custom tokenizer that keeps the hyphenated words see the botton of the answer. 此处,定义了一个自定义标记生成器,该标记生成器使用一组内置( Here, a custom tokenizer is defined, that tokenizes text into tokens using a set of built-in ( The 如您所见,这些都是正则表达式,用于处理单词标点,中缀.请参见 Spacy标记器算法: As you see, these are all regular expressions and are used to process in-word punctuation, infixes. See the Spacy tokenizer algorithm: 算法可以总结如下: The algorithm can be summarized as follows:
现在,当我们处于中缀处理步骤时,这些正则表达式也将基于这些模式用于将文本拆分为标记. Now, when we are at infix handling step, these regular expressions are used to split text into tokens based also on these patterns. 例如 E.g. The The 答案 添加新规则时应非常小心,并检查它们是否与已添加的规则不重叠. You should be very careful when adding new rules and check if they do not overlap with already added ones. 例如当您添加 E.g. when you add 查看示例代码: 输出: 详细信息 并且如果您想使用保留带连字符的字母单词的自定义令牌生成器作为单个令牌,则必须重新定义行说明了这一点,您需要摆脱掉它.由于它是唯一包含 And if you want to use a custom tokenizer that keeps hyphenated letter words as single tokens you will have to re-define the 这篇关于SpaCy-字内连字符.如何对待他们一个字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
import spacy
nlp = spacy.load('en_core_web_md')
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
print (token.text)
["That", "is", "Yahya", "'s", "laptop", "-", "cover", ".", "3.14", "!"]
from spacy.util import compile_infix_regex
infixes = nlp.Defaults.prefixes + tuple([r"[-]~"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
print (token.text)
["That", "is", "Yahya", "'", "s", "laptop-cover.", "3.14", "!"]
推荐答案
nlp.Defaults.prefixes
)和自定义([./]
,[-]~
,(.'.)
)模式将文本标记为标记.nlp.Defaults.prefixes
) and custom ([./]
, [-]~
, (.'.)
) patterns.nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
是元组连接操作,结果看起来像nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
is tuple concatenation operation, the result looks like('§', '%', '=', '—', '–', '\\+(?![0-9])', '…', '……', ',', ':', ';', '\\!', '\\?', '¿', '؟', '¡', '\\(', '\\)', '\\[', '\\]', '\\{', '\\}', '<', '>', '_', '#', '\\*', '&', '。', '?', '!', ',', '、', ';', ':', '~', '·', '।', '،', '؛', '٪', '\\.\\.+', '…', "\\'", '"', '"', '"', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '(', ')', '〔', '〕', '【', '】', '《', '》', '〈', '〉', '\\$', '£', '€', '¥', '฿', 'US\\$', 'C\\$', 'A\\$', '₽', '﷼', '₴', '[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF\\U0001FA60-\\U0001FA6D]', '[/.]', '-~', "(.'.)")
[/.]
很重要,因为如果您不添加abc.def/ghi
,则abc.def/ghi
将是一个令牌,但是添加了模式后,它将被拆分为'abc', '.', 'def', '/', 'ghi'
.[/.]
is important because if you do not add it, abc.def/ghi
will be a single token, but with the pattern added, it will be split into 'abc', '.', 'def', '/', 'ghi'
.[-]~
(与-~
相同)与-
相匹配,并希望紧随其后与~
相匹配,但是由于它不存在,因此会跳过-
并且不会发生拆分,您将获得整个'Marketing-Representative-'
令牌.但是请注意,如果句子中包含'Marketing-~Representative-'
,并且使用-~
正则表达式,则会得到['Marketing', '-~', 'Representative-']
的结果,因为将有一个匹配项. [-]~
(that is the same as -~
) matches a -
and wants to match ~
right after, but since it is not there, the -
is skipped and no split occurs, you get the whole 'Marketing-Representative-'
token. Note, however, if you have 'Marketing-~Representative-'
in the sentence, and you use -~
regex you will get ['Marketing', '-~', 'Representative-']
as a result as there will be a match. .'.
正则表达式匹配任何char + '
+任何char.点匹配正则表达式中的任何字符.因此,该规则只是将这些标记从句子中标记化(分离出来)(例如n't
,r'd等).'.
regex matches any char + '
+ any char. A dot matches any char in regex. So, the rule just tokenizes (splits out) these tokens out of the sentence (e.g. n't
, r'd, etc.)r"\b's\b"
以拆分通用案例撇号时,您应该覆盖" nlp.Defaults.prefixes
中的"\\'"
规则.如果您不打算将'
作为前缀匹配,请删除它,或者通过将nlp.Defaults.prefixes
附加到这些规则上来优先考虑自定义规则,反之亦然.r"\b's\b"
to split out Genetive case apostrophe-s, you should "override" the "\\'"
rule from nlp.Defaults.prefixes
. Either remove it if you do not plan to match '
as infix, or give priority to your custom rules by appendng the nlp.Defaults.prefixes
to those rules, not vice versa.import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_md")
infixes = tuple([r"'s\b", r"(?<!\d)\.(?!\d)"]) + nlp.Defaults.prefixes
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']
r"'s\b"
-与's
匹配并带有单词边界r"(?<!\d)\.(?!\d)
-匹配在.
之前或之后没有数字的位置.
r"'s\b"
- matches 's
that are followed with a word boundaryr"(?<!\d)\.(?!\d)
- matches a .
that is not preceded or followed with a digit.-|–|—|--|---|——|~
字符串的项目,因此将其从infixes
中删除并重新编译中缀模式将更加容易:infixes
: the r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
line accounts for that and you need to get rid of it. Since it is the only item that contains a -|–|—|--|---|——|~
string it will be easier to drop this item from the infixes
and re-compile the infix pattern:import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_sm")
inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
# => ['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']