确定正则表达式的特异性 [英] Determine regular expression's specificity

查看:39
本文介绍了确定正则表达式的特异性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定以下正则表达式:

 - alice@[a-z]+\.[a-z]+
 - [a-z]+@[a-z]+\.[a-z]+
 - .*

字符串 alice@myprovider.com 显然会匹配所有三个正则表达式.在我正在开发的应用程序中,我们只对最具体"的匹配感兴趣.在这种情况下,这显然是第一个.
不幸的是,似乎没有办法做到这一点.我们正在使用 PCRE,我没有找到方法来做到这一点,而且在互联网上搜索也没有结果.
一种可能的方法是保持正则表达式按降序排序,然后简单地取第一个匹配项.当然,接下来的问题将是如何对正则表达式数组进行排序.将确保数组排序的责任交给最终用户不是一种选择.所以我希望你们能在这里帮助我...

The string alice@myprovider.com will obviously match all three regular expressions. In the application I am developing, we are only interested in the 'most specific' match. In this case this is obviously the first one.
Unfortunately there seems no way to do this. We are using PCRE and I did not find a way to do this and a search on the Internet was also not fruitful.
A possible way would be to keep the regular expressions sorted on descending specificity and then simply take the first match. Of course then the next question would be how to sort the array of regular expressions. It is not an option to give the responsability to the end-user to ensure that the array is sorted. So I hope you guys could help me out here...

谢谢!!

保罗

推荐答案

以下是我根据 Donald Miner 的研究论文开发的针对 MAC 地址应用规则的解决方案,用 Python 实现.

The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.

基本上,最具体的匹配来自不是任何其他匹配模式的超集的模式.对于特定的问题域,您可以创建一系列测试(函数)来比较两个 RE 并返回超集,或者它们是否是正交的.这使您可以构建匹配树.对于特定的输入字符串,您可以浏览根模式并找到任何匹配项.然后通过他们的子模式.如果在任何时候,正交模式匹配,就会引发错误.

Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.

设置

import re

class RegexElement:
    def __init__(self, string,index):
        self.string=string
        self.supersets = []
        self.subsets = []
        self.disjoints = []
        self.intersects = []
        self.maybes = []
        self.precompilation = {}
        self.compiled = re.compile(string,re.IGNORECASE)
        self.index = index

SUPERSET  = object()
SUBSET    = object()
INTERSECT = object()
DISJOINT  = object()
EQUAL     = object()

测试

每个测试采用 2 个字符串(a 和 b)并尝试确定它们之间的关系.如果测试无法确定关系,则返回 None.

Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.

SUPERSET 表示 ab 的超集.b 的所有匹配都将匹配 a.

SUPERSET means a is a superset of b. All matches of b will match a.

SUBSET 表示 ba 的超集.

SUBSET means b is a superset of a.

INTERSECT 表示 a 的一些匹配将匹配 b,但有些不会,而 b 的一些匹配> 不会匹配 a.

INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.

DISJOINT 表示没有匹配的 a 将匹配 b.

DISJOINT means no matches of a will match b.

EQUAL 表示 a 的所有匹配将匹配 b 并且 b 的所有匹配将匹配 一个.

EQUAL means all matches of a will match b and all matches of b will match a.

    def equal_test(a, b):  
        if a == b: return EQUAL

图表

  class SubsetGraph(object):
    def __init__(self, tests):
        self.regexps = []
        self.tests = tests
        self._dirty = True
        self._roots = None

    @property
    def roots(self):
        if self._dirty:
            r = self._roots = [i for i in self.regexps if not i.supersets]
            return r
        return self._roots

    def add_regex(self, new_regex):
        roots = self.roots
        new_re = RegexElement(new_regex)
        for element in roots:
            self.process(new_re, element)
        self.regexps.append(new_re)

    def process(self, new_re, element):
        relationship = self.compare(new_re, element)
        if relationship:
            getattr(self, 'add_' + relationship)(new_re, element)

    def add_SUPERSET(self, new_re, element):
        for i in element.subsets:
            i.supersets.add(new_re)
            new_re.subsets.add(i)

        element.supersets.add(new_re)
        new_re.subsets.add(element)

    def add_SUBSET(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        element.subsets.add(new_re)
        new_re.supersets.add(element)

    def add_DISJOINT(self, new_re, element):
        for i in element.subsets:
            i.disjoints.add(new_re)
            new_re.disjoints.add(i)

        new_re.disjoints.add(element)
        element.disjoints.add(new_re)

    def add_INTERSECT(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        new_re.intersects.add(element)
        element.intersects.add(new_re)

    def add_EQUAL(self, new_re, element):
        new_re.supersets = element.supersets.copy()
        new_re.subsets = element.subsets.copy()
        new_re.disjoints = element.disjoints.copy()
        new_re.intersects = element.intersects.copy()

    def compare(self, a, b):
        for test in self.tests:
            result = test(a.string, b.string)
            if result:
                return result

    def match(self, text, strict=True):
        matches = set()
        self._match(text, self.roots, matches)
        out = []
        for e in matches:
            for s in e.subsets:
                if s in matches:
                    break
            else:
                out.append(e)
        if strict and len(out) > 1:
            for i in out:
                print(i.string)
            raise Exception("Multiple equally specific matches found for " + text)
        return out

    def _match(self, text, elements, matches):
        new_elements = []
        for element in elements:
            m = element.compiled.match(text)
            if m:
                matches.add(element)
                new_elements.extend(element.subsets)
        if new_elements:
            self._match(text, new_elements, matches)

使用

graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")

完整的可用版本在这里.

这篇关于确定正则表达式的特异性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆