spacy规则匹配器从匹配的句子中提取值 [英] spacy rule-matcher extract value from matched sentence

查看:530
本文介绍了spacy规则匹配器从匹配的句子中提取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个自定义规则匹配,可以匹配文档中的某些句子.我现在想从匹配的句子中提取一些数字.但是,匹配的句子并不总是具有相同的形状和形式.最好的方法是什么?

I have a custom rule matching in spacy, and I am able to match some sentences in a document. I would like to extract some numbers now from the matched sentences. However, the matched sentences do not have always have the same shape and form. What is the best way to do this?

# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]

pattern = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
]

pattern_1 = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern, pattern_1)

for index, text in enumerate(texts):
    print(f"Case {index}")
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)

我的输出将是

Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5

我只想返回数字(平方米).类似于[31、31、31、31、31.2],而不是全文.凭空执行此操作的正确方法是什么?

I would like to return the number (square meters) only. Something like [31, 31, 31, 31, 31.2] rather than the full text. What is the correct way to do this in spacy?

推荐答案

由于每个匹配项都包含一次LIKE_NUM实体,因此您可以解析匹配子树并返回此类令牌的第一个匹配项:

Since each match contains a single occurrence of LIKE_NUM entity you may just parse the match subtree and return the first occurrence of such a token:

value = [token for token in span.subtree if token.like_num][0]

测试:

results = []
for text in texts:
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # The matched span
        results.append([token for token in span.subtree if token.like_num][0])

print(results) # => [31, 31, 31, 31, 31,2]

这篇关于spacy规则匹配器从匹配的句子中提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆