如何将自定义规则添加到 spaCy 标记器以在单个标记中分解 HTML? [英] How to add custom rules to spaCy tokenizer to break down HTML in single tokens?

查看:66
本文介绍了如何将自定义规则添加到 spaCy 标记器以在单个标记中分解 HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有很多资源可以解决这个问题,但我无法让 spaCy 完全按照我的意愿去做.

我想向 spaCy 标记器添加规则,以便文本中的 HTML 标记(例如 <br/> 等...)成为单个标记.

我现在正在使用merge_noun_chunks"管道,所以我得到这样的令牌:
文档
天文台安全系统"
(这是一个令牌)

我想添加一个规则,以便将其拆分为 3 个令牌:
文档"、
"、天文台安全系统"

我查了很多资源:这里也在这里.但我无法让它在我的情况下工作

我已经试过了:

<预><代码>infix_re = re.compile(r'''<[\w+]\/>''')prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)返回标记器(nlp.vocab,prefix_search=prefix_re.search,suffix_search=suffix_re.search,infix_finditer=infix_re.finditer,token_match=无)

我不确定我是否完全理解更改中缀的作用.我还应该按照此处的建议从前缀 中删除 < 吗?

解决方案

实现这一目标的一种方法似乎是让分词器同时具备

  1. 分解包含没有空格的标记的标记,以及
  2. 肿块"类似标签的序列作为单个标记.

要像示例中那样拆分标记,您可以修改标记器中缀(在 此处描述的方式):

infixes = nlp.Defaults.infixes + [r'([><])']nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

为确保标签被视为单个标记,您可以使用特殊情况";(请参阅分词器概述方法文档).您可以为打开的、关闭的和空的标签添加特殊情况,例如:

# 打开和关闭对于html body i br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])# 空的对于br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])

综合:

导入空间从 spacy.symbols 导入 ORTHnlp = spacy.load(en_core_web_trf")infixes = nlp.Defaults.infixes + [r'([><])']nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer对于html body i br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])对于br p".split() 中的 tagName:nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])

这似乎产生了预期的结果.例如,申请...

text = """documentation
The Observatory

安全</p>系统</body></body>></body>>打印(标记化:")对于 nlp(text) 中的 t:打印(吨)

... 将单独打印整个标签:

# ... snip文件<br/>这# ...剪断

我发现分词器的解释方法在这种情况下非常有用.它为您详细说明了标记化的原因.

I know there are a lot of resources out there for this problem, but I could not get spaCy to do exactly what I want.

I would like to add rules to my spaCy tokenizer so that HTML tags (such as <br/> etc...) in my text would be a single token.

I am right now using the "merge_noun_chunks" pipe, so I get tokens like this one:
"documentation<br/>The Observatory Safety System" (this is a single token)

I would like to add a rule so that this would get split into 3 tokens:
"documentation", "<br/>", "The Observatory Safety System"
I've looked up a lot of resources: here, also here. But I couldn't get that to work in my case

I have tried this:

    
    infix_re = re.compile(r'''<[\w+]\/>''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

I am not sure I understand exactly what changing the infix does. Should I also remove < from prefixes as suggested here?

解决方案

One way to achieve this seems to involve making the tokenizer both

  1. break up tokens containing a tag without whitespace, and
  2. "lump" tag-like sequences as single tokens.

To split up tokens like the one in your example, you can modify the tokenizer infixes (in the manner described here):

infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

To ensure tags are regarded as single tokens, you can use "special cases" (see the tokenizer overview or the method docs). You would add special cases for opened, closed and empty tags, e.g.:

# open and close
for tagName in "html body i br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])    
    nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])    

# empty
for tagName in "br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])    

Taken together:

import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'([><])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

for tagName in "html body i br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}>", [{ORTH: f"<{tagName}>"}])    
    nlp.tokenizer.add_special_case(f"</{tagName}>", [{ORTH: f"</{tagName}>"}])    

for tagName in "br p".split():
    nlp.tokenizer.add_special_case(f"<{tagName}/>", [{ORTH: f"<{tagName}/>"}])    

This seems to yield the expected result. E.g., applying ...

text = """<body>documentation<br/>The Observatory <p> Safety </p> System</body>"""
print("Tokenized:")
for t in nlp(text):
    print(t)

... will print the tag in its entirety and on its own:

# ... snip
documentation
<br/>
The
# ... snip

I found the tokenizer's explain method quite helpful in this context. It gives you a breakdown of what was tokenized why.

这篇关于如何将自定义规则添加到 spaCy 标记器以在单个标记中分解 HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆