数字之前或之后的度量单位上的spacy规则匹配器 [英] spacy rule matcher on unit of measure before or after digit
问题描述
我是spacy的新手,我正在尝试在某些文本中匹配一些度量.我的问题是度量单位有时在值之前,有时在值之后.在其他一些情况下,则使用不同的名称.这是一些代码:
I am new to spacy and i am trying to match some measurements in some text. My problem is that the unit of measure sometimes is before, sometimes is after the value. In some other cases has a different name. Here is some code:
nlp = spacy.load('en_core_web_sm')
# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"
pattern = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"LOWER": "sq", "OP": "?"},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"IS_DIGIT": True},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"LOWER": "sq", "OP": "?"}
]
doc = nlp(text)
matcher = Matcher(nlp.vocab)
matcher.add("Surface", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
我有两个问题: 1-模式应该能够匹配所有情况1到5,但在我的情况1中,输出为
I have two problems : 1 - the pattern should be able to match all cases 1 to 5, but in my case 1 the output is
4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq
在我看来,这是重复的比赛.
which to me seems that it is a duplicate match.
2-情况6不应该匹配,而是与我的模式匹配. 关于如何改善这一点有什么建议吗?
2 - case 6 should not match, but instead, with my pattern it is matched. Any suggestion on how to improve this?
是否可以在模式内建立OR条件?像
is it possible to build an OR condition within the pattern? something like
pattern = [
{"POS": "DET", "OP": "?"},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
[
[{"LOWER": "sq", "OP": "?"},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True}]
OR
[{"LIKE_NUM": True},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"LOWER": "sq", "OP": "?"} ]
]
]
推荐答案
您不能使用类似的OR,但可以为同一标签定义单独的模式.因此,您需要两种模式,一种将与sq
或square
或meters
的数字匹配,或者将其与这些单词的组合匹配,另一种模式将与数字与至少其后的一个单词匹配.
You cannot use an OR like that, but you may define separate patterns for the same label. So, you need two patterns, one will match a number with either sq
or square
or meters
or a combination of these words before it, and another pattern that matches a number with at least one of these words after.
代码段:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
"the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
{"LIKE_NUM": True}
]
pattern2 = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
]
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
输出:
4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
部分匹配一个或多个与正则表达式匹配的令牌(由于"OP": "+"
):
The {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
part matches one or more tokens (due to "OP": "+"
) that match the regex:
-
^
-令牌的开始 -
(?i:
-不区分大小写的修饰符组的开始:-
sq(?:uare)?
-sq
或square
-
|
-或 -
m(?:et(?:er|re)s?)?
-m
,meter
/metre
或meters
/metres
^
- start of the token(?i:
- start of a case insensitive modifier group:sq(?:uare)?
-sq
orsquare
|
- orm(?:et(?:er|re)s?)?
-m
,meter
/metre
ormeters
/metres
这篇关于数字之前或之后的度量单位上的spacy规则匹配器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-