空间实体规则不适用于Cardinal(社会安全号码) [英] Spacy Entity Rule doesn't work for cardinal (Social Security number)

查看:16
本文介绍了空间实体规则不适用于Cardinal(社会安全号码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用实体规则为社保号添加了新标签。 我甚至设置了OVERWRITE_ENTS=TRUE,但它仍然无法识别

我验证了正则表达式是否正确。不知道我还需要做什么 我以前尝试过="ner",但结果相同

text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032"
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
ruler.add_patterns([{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"d{3}[^w]d{2}[^w]d{4}"}}]}])
nlp.add_pipe(ruler)
doc  = nlp(text)
for ent in doc.ents:
    print("{} {}".format(ent.text, ent.label_))

推荐答案

实际上,您拥有的SSN被Spacy标记为5个块:

print([token.text for token in nlp("690-96-4032")])
# => ['690', '-', '96', '-', '4032']

因此,要么使用自定义令牌器,其中数字之间的-不被分割为单独的令牌,要么-更简单-为连续的5个令牌创建模式:

patterns = [{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^d{3}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^d{2}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^d{4}$"}} ]}]

完整的Spacy演示:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
patterns = [{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^d{3}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^d{2}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^d{4}$"}} ]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032"
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
# => [('605', 'CARDINAL'), ('690-96-4032', 'SSN')]

因此,{"TEXT": {"REGEX": r"^d{3}$"}}匹配仅由三位数字组成的令牌,{"TEXT": "-"}-字符,依此类推。

用空格覆盖连字符数字标记化

如果您对如何通过覆盖默认标记化来实现感兴趣,请注意infixesr"(?<=[0-9])[+-*^](?=[0-9-])"正则表达式make spacy将连字符分隔的数字分割为单独的标记。要使1-2-31-2相似的子字符串被标记为单个标记,请从正则表达式中删除-。您不能这样做,这要复杂得多:您需要用两个regexp:r"(?<=[0-9])[+*^](?=[0-9-])"r"(?<=[0-9])-(?=-)"替换它,因为-也是在数字((?<=[0-9]))和连字符(参见(?=[0-9-]))之间检查的。

这样,整个事情看起来就像

import spacy
from spacy.tokenizer import Tokenizer
from spacy.pipeline import EntityRuler
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    # Take out the existing rule and replace it with a custom one:
    inf = list(nlp.Defaults.infixes)
    inf.remove(r"(?<=[0-9])[+-*^](?=[0-9-])")
    inf = tuple(inf)
    infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"]) 
    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
ruler = EntityRuler(nlp, overwrite_ents=True)
ruler.add_patterns([{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^d{3}Wd{2}Wd{4}$"}}]}])
nlp.add_pipe(ruler)

text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032. Some 9---al"
doc = nlp(text)
print([t.text for t in doc])
# =>  ['My', 'name', 'is', 'yuyyvb', 'and', 'I', 'leave', 'on', '605', 'W', 'Clinton', 'Street', '.', 'My', 'social', 'security', '690-96-4032', '.', 'Some', '9', '-', '--al']
print([(ent.text, ent.label_) for ent in doc.ents])
# => [('605', 'CARDINAL'), ('690-96-4032', 'SSN'), ('9', 'CARDINAL')]

如果省略r"(?<=[0-9])-(?=-)"['9', '-', '--al']将变为'9---al'

注意您需要使用^d{3}Wd{2}Wd{4}$regex:^$匹配令牌的开始和结束(否则,部分匹配的令牌也将标识为SSN),[^w]等于W

这篇关于空间实体规则不适用于Cardinal(社会安全号码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆