如何优先考虑一个正则表达式模式 [英] how to give priority for a regex pattern over another

查看:43
本文介绍了如何优先考虑一个正则表达式模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用正则表达式来提取大学名称.主要观察到两种模式.

I am using regular expressions to extract university names. Mainly two patterns are observed.

  1. 一些名字"的大学 --> 例如:安娜大学
  2. 某事"大学 --> 例如:埃克塞特大学

为此,我写了两个模式,

For this, I have written two patterns as,

regex = re.compile('|'.join([r'[Uu]niversity of (\w+){1,3}',r'(?:\S+\s){1,3}\S*[uU]niversity']))

但在少数情况下,我没有得到正确的预期答案.例如,

But in few cases I am not getting proper expected answer. For example,

sentence  = "Biology Department University of Vienna"

对于这句话,应用上面的正则表达式,我得到

For this sentence, applying above regular expression, I am getting

"Biology Department University"

这是错误的.我觉得,因为两个模式都会匹配,所以第二个模式匹配并提取短语.

which is wrong. I feel, since both patterns will be matched, the second pattern is getting matched and phrase is extracted.

我需要优先考虑第一个模式,以便在类似场景中提取某事物的大学".

I need to give priority for first pattern, so that "university of something" will be extracted in similar scenario.

谁能帮忙

推荐答案

一般来说,正则表达式中的替代项是从左到右计算的,因此首先检查最左边的替代项,给予它们优先权.但是,您已经这样做了 - 您仍然从交替的右侧获得匹配项的原因是该匹配项可能在字符串的较早位置.

In general, alternations in regular expressions are evaluated from left to right, so the leftmost alternatives are checked first, giving them priority. You already did that, though - the reason why you still got the match from the right side of the alternation is that that match is possible earlier in the string.

因此,您需要更具体,并且仅当没有 of 跟随时才允许 "Foo University" 匹配.您可以为此使用否定前瞻断言:

Therefore you need to be more specific and only allow a "Foo University" match only if no of follows. You can use a negative lookahead assertion for this:

regex = re.compile('|'.join([r'university of (\w+){1,3}',
                             r'(?:\S+\s){1,3}\S*university(?!\s+of\b)']),
                   flags=re.I)

这篇关于如何优先考虑一个正则表达式模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆