数字之前或之后的度量单位上的spacy规则匹配器 [英] spacy rule matcher on unit of measure before or after digit

查看:82
本文介绍了数字之前或之后的度量单位上的spacy规则匹配器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是spacy的新手,我正在尝试在某些文本中匹配一些度量.我的问题是度量单位有时在值之前,有时在值之后.在其他一些情况下,则使用不同的名称.这是一些代码:

I am new to spacy and i am trying to match some measurements in some text. My problem is that the unit of measure sometimes is before, sometimes is after the value. In some other cases has a different name. Here is some code:

nlp = spacy.load('en_core_web_sm')

# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"

pattern = [
    {"IS_STOP": True}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"}, 
    {"LOWER": "sq", "OP": "?"},
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"IS_DIGIT": True}, 
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"LOWER": "sq", "OP": "?"} 
]

doc = nlp(text)

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

我有两个问题: 1-模式应该能够匹配所有情况1到5,但在我的情况1中,输出为

I have two problems : 1 - the pattern should be able to match all cases 1 to 5, but in my case 1 the output is

4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq 

在我看来,这是重复的比赛.

which to me seems that it is a duplicate match.

2-情况6不应该匹配,而是与我的模式匹配. 关于如何改善这一点有什么建议吗?

2 - case 6 should not match, but instead, with my pattern it is matched. Any suggestion on how to improve this?

是否可以在模式内建立OR条件?像

is it possible to build an OR condition within the pattern? something like

pattern = [
    {"POS": "DET", "OP": "?"}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    [
      [{"LOWER": "sq", "OP": "?"},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True}]
     OR
      [{"LIKE_NUM": True},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"LOWER": "sq", "OP": "?"} ]
    ]
]

推荐答案

您不能使用类似的OR,但可以为同一标签定义单独的模式.因此,您需要两种模式,一种将与sqsquaremeters的数字​​匹配,或者将其与这些单词的组合匹配,另一种模式将与数字与至少其后的一个单词匹配.

You cannot use an OR like that, but you may define separate patterns for the same label. So, you need two patterns, one will match a number with either sq or square or meters or a combination of these words before it, and another pattern that matches a number with at least one of these words after.

代码段:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
     "the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
      {"LIKE_NUM": True}
    ]
pattern2 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True},
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
    ]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)

for text in texts:
  doc = nlp(text)
  matches = matcher(doc)
  for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

输出:

4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square

{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}部分匹配一个或多个与正则表达式匹配的令牌(由于"OP": "+"):

The {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"} part matches one or more tokens (due to "OP": "+") that match the regex:

  • ^-令牌的开始
  • (?i:-不区分大小写的修饰符组的开始:
    • sq(?:uare)?-sqsquare
    • |-或
    • m(?:et(?:er|re)s?)?-mmeter/metremeters/metres
    • ^ - start of the token
    • (?i: - start of a case insensitive modifier group:
      • sq(?:uare)? - sq or square
      • | - or
      • m(?:et(?:er|re)s?)? - m, meter/metre or meters/metres

      这篇关于数字之前或之后的度量单位上的spacy规则匹配器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆