如何将自定义规则添加到 spaCy 标记器以在单个标记中分解 HTML? [英] How to add custom rules to spaCy tokenizer to break down HTML in single tokens?
问题描述
我知道有很多资源可以解决这个问题,但我无法让 spaCy 完全按照我的意愿去做.
我想向 spaCy 标记器添加规则,以便文本中的 HTML 标记(例如 <br/>
等...)成为单个标记.
我现在正在使用merge_noun_chunks"管道,所以我得到这样的令牌:文档
(这是一个令牌)
天文台安全系统"
我想添加一个规则,以便将其拆分为 3 个令牌:文档"、
"、天文台安全系统"
我查了很多资源:这里,也在这里.但我无法让它在我的情况下工作
我已经试过了:
<预><代码>infix_re = re.compile(r'''<[\w+]\/>''')prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)返回标记器(nlp.vocab,prefix_search=prefix_re.search,suffix_search=suffix_re.search,infix_finditer=infix_re.finditer,token_match=无)我不确定我是否完全理解更改中缀的作用.我还应该按照此处的建议从前缀 中删除 <
吗?
实现这一目标的一种方法似乎是让分词器同时具备
- 分解包含没有空格的标记的标记,以及
- 肿块"类似标签的序列作为单个标记.
要像示例中那样拆分标记,您可以修改标记器中缀(在 此处描述的方式):
infixes = nlp.Defaults.infixes + [r'([><])']nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
为确保标签被视为单个标记,您可以使用特殊情况";(请参阅分词器概述或方法文档).您可以为打开的、关闭的和空的标签添加特殊情况,例如:
# 打开和关闭对于html body i br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])# 空的对于br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])
综合:
导入空间从 spacy.symbols 导入 ORTHnlp = spacy.load(en_core_web_trf")infixes = nlp.Defaults.infixes + [r'([><])']nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer对于html body i br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])对于br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])
这似乎产生了预期的结果.例如,申请...
text = """documentation
The Observatory 安全</p>系统</body></body>></body>>打印(标记化:")对于 nlp(text) 中的 t:打印(吨)
... 将单独打印整个标签:
# ... snip文件<br/>这# ...剪断
我发现分词器的解释方法在这种情况下非常有用.它为您详细说明了标记化的原因.
I know there are a lot of resources out there for this problem, but I could not get spaCy to do exactly what I want.
I would like to add rules to my spaCy tokenizer so that HTML tags (such as <br/>
etc...) in my text would be a single token.
I am right now using the "merge_noun_chunks" pipe, so I get tokens like this one:
"documentation<br/>The Observatory Safety System"
(this is a single token)
I would like to add a rule so that this would get split into 3 tokens:
"documentation", "<br/>", "The Observatory Safety System"
I've looked up a lot of resources: here, also here. But I couldn't get that to work in my case
I have tried this:
infix_re = re.compile(r'''<[\w+]\/>''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
I am not sure I understand exactly what changing the infix does. Should I also remove <
from prefixes as suggested here?
One way to achieve this seems to involve making the tokenizer both
- break up tokens containing a tag without whitespace, and
- "lump" tag-like sequences as single tokens.
To split up tokens like the one in your example, you can modify the tokenizer infixes (in the manner described here):
infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
To ensure tags are regarded as single tokens, you can use "special cases" (see the tokenizer overview or the method docs). You would add special cases for opened, closed and empty tags, e.g.:
# open and close
for tagName in "html body i br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])
nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])
# empty
for tagName in "br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])
Taken together:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
for tagName in "html body i br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])
nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])
for tagName in "br p".split():
nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])
This seems to yield the expected result. E.g., applying ...
text = """<body>documentation<br/>The Observatory <p> Safety </p> System</body>"""
print("Tokenized:")
for t in nlp(text):
print(t)
... will print the tag in its entirety and on its own:
# ... snip
documentation
<br/>
The
# ... snip
I found the tokenizer's explain method quite helpful in this context. It gives you a breakdown of what was tokenized why.
这篇关于如何将自定义规则添加到 spaCy 标记器以在单个标记中分解 HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!