当表达式具有多种可能形式时,如何编写语法 [英] How to write grammar for an expression when it can have many possible forms

查看:59
本文介绍了当表达式具有多种可能形式时,如何编写语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些句子需要转换为正则表达式代码,而我正尝试使用Pyparsing.句子基本上是搜索规则,告诉我们要搜索什么.

I have some sentences that I need to convert to regex code and I was trying to use Pyparsing for it. The sentences are basically search rules, telling us what to search for.

例句-

  1. LINE_CONTAINS this is a phrase -这是一个搜索规则示例,告诉您正在搜索的行应包含短语this is a phrase

  1. LINE_CONTAINS this is a phrase -this is an example search rule telling that the line you are searching on should have the phrase this is a phrase

LINE_STARTSWITH However we-这是一个搜索规则示例,告诉您正在搜索的行应以短语However we

LINE_STARTSWITH However we - this is an example search rule telling that the line you are searching on should start with the phrase However we

规则也可以组合,例如-LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we

The rules can be combined too, like- LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we

可以在此处找到所有实际句子的列表(如有必要).
所有行均以上述2个符号中的任何一个开头(称为line_directives).现在,我试图解析这些句子,然后将它们转换为正则表达式代码.我开始为语法编写BNF,这就是我想出的-

A list of all actual sentences (if necessary) can be found here.
All lines start with either of the 2 symbols mentioned above (call them line_directives). Now, I am trying to parse these sentences and then convert them to regex code. I started writing a BNF for my grammar and this is what I came up with -

lpar ::= '{'
rpar ::= '}'
line_directive ::= LINE_CONTAINS | LINE_STARTSWITH
phrase ::= lpar(?) + (word+) + rpar(?) # meaning if a phrase is parenthesized, its still the same

upto_N_words ::= lpar + 'UPTO' + num + 'WORDS' + rpar
N_words ::= lpar + num + 'WORDS' + rpar
upto_N_characters ::= lpar + 'UPTO' + num + 'CHARACTERS' + rpar
N_characters ::= lpar + num + 'CHARACTERS' + rpar

JOIN_phrase ::= phrase + JOIN + phrase
AND_phrase ::= phrase (+ JOIN + phrase)+
OR_phrase ::= phrase (+ OR + phrase)+
BEFORE_phrase ::= phrase (+ BEFORE + phrase)+
AFTER_phrase ::= phrase (+ AFTER + phrase)+

braced_OR_phrase ::= lpar + OR_phrase + rpar
braced_AND_phrase ::= lpar + AND_phrase + rpar
braced_BEFORE_phrase ::= lpar + BEFORE_phrase + rpar
braced_AFTER_phrase ::= lpar + AFTER_phrase + rpar
braced_JOIN_phrase ::= lpar + JOIN_phrase + rpar

rule ::= line_directive + subrule
final_expr ::= rule (+ AND/OR + rule)+

问题出在subrule上,基于我已有的经验数据,我已经能够提出以下所有表达式-

The problem is the subrule, for which (based on the empirical data I have) I have been able to come up with all of the following expressions -

subrule ::= phrase
        ::= OR_phrase
        ::= JOIN_phrase
        ::= BEFORE_phrase
        ::= AFTER_phrase
        ::= AND_phrase
        ::= phrase + upto_N_words + phrase
        ::= braced_OR_phrase + phrase
        ::= phrase + braced_OR_phrase
        ::= phrase + braced_OR_phrase + phrase
        ::= phrase + upto_N_words + braced_OR_phrase
        ::= phrase + upto_N_characters + phrase
        ::= braced_OR_phrase + phrase + upto_N_words + phrase
        ::= phrase + braced_OR_phrase + upto_N_words + phrase

举个例子,我的一句话是LINE_CONTAINS the objective of this study was {to identify OR identifying} genes upregulated.为此,上述子规则为phrase + braced_OR_phrase + phrase.

To give an example, one sentence I have is LINE_CONTAINS the objective of this study was {to identify OR identifying} genes upregulated. For this the subrule as mentioned above is phrase + braced_OR_phrase + phrase.

所以我的问题是如何为subrule写一个简单的BNF语法表达式,以便能够使用Pyparsing轻松为其编写语法?另外,绝对欢迎您提出关于我当前技术的任何意见.

So my question is how do I write a simple BNF grammar expression for the subrule so that I would be able to easily code the grammar for it using Pyparsing? Also, any input regarding my present technique is absolutely welcome.

编辑:在应用了@Paul在其答案中阐明的原理之后,下面是该代码的 MCVE 版本.它使用要解析的句子列表hrrsents,解析每个句子,将其转换为对应的正则表达式,并返回正则表达式字符串列表-

After applying the principles elucidated by @Paul in his answer, here is the MCVE version of the code. It takes a list of sentences to be parsed hrrsents, parses each sentence, converts it to it's corresponding regex and returns a list of regex strings -

from pyparsing import *
import re


def parse_hrr(hrrsents):
    UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
    LBRACE,RBRACE = map(Suppress, "{}")
    integer = pyparsing_common.integer()

    LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
        """LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
    BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
    keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH

    class Node(object):
        def __init__(self, tokens):
            self.tokens = tokens

        def generate(self):
            pass

    class LiteralNode(Node):
        def generate(self):
            return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list

    class ConsecutivePhrases(Node):
        def generate(self):
            join_these=[]
            tokens = self.tokens[0]
            for t in tokens:
                tg = t.generate()
                join_these.append(tg)
            seq = []
            for word in join_these[:-1]:
                if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
                    seq.append(word + "")
                else:
                    seq.append(word + "\s+")
            seq.append(join_these[-1])
            result = "".join(seq)
            return result

    class AndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            join_these=[]
            for t in tokens[::2]:
                tg = t.generate()
                tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
                join_these.append(tg_mod)
            joined = ''.join(ele for ele in join_these)
            full = '('+ joined+')'
            return full

    class OrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            joined = '|'.join(t.generate() for t in tokens[::2])
            full = '('+ joined+')'
            return full

    class LineTermNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            dir_phr_map = {
                'LINE_CONTAINS': lambda a:  r"((?:(?<=^)|(?<=[\W_]))" + a + r"(?=[\W_]|$))456", 
                'PARA_STARTSWITH':
                    lambda a: ( r"(^" + a + r"(?=[\W_]|$))457") if 'gene' in repr(a)
                    else (r"(^" + a + r"(?=[\W_]|$))458")}

            for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
                ret = dir_phr_map[line_dir](phr_term.generate())
            return ret

    class LineAndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            return '&&&'.join(t.generate() for t in tokens[::2])

    class LineOrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            return '@@@'.join(t.generate() for t in tokens[::2])

    class UpToWordsNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            word_re = r"([\w]+\s*)"
            for op, operand in zip(tokens[1::2], tokens[2::2]):
                # op contains the parsed "upto" expression
                ret += "(%s{0,%d})" % (word_re, op)
            return ret

    class UpToCharactersNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            char_re = r"\w"
            for op, operand in zip(tokens[1::2], tokens[2::2]):
                # op contains the parsed "upto" expression
                ret += "((%s){0,%d})" % (char_re, op)
            return ret

    class BeforeAfterJoinNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            operator_opn_map = {'BEFORE': lambda a,b: a + '.*?' + b, 'AFTER': lambda a,b: b + '.*?' + a, 'JOIN': lambda a,b: a + '[- ]?' + b}
            ret = tokens[0].generate()
            for operator, operand in zip(tokens[1::2], tokens[2::2]):
                ret = operator_opn_map[operator](ret, operand.generate()) # this is basically calling a dict element, and every such element requires 2 variables (a&b), so providing them as ret and op.generate
            return ret

## THE GRAMMAR
    word = ~keyword + Word(alphas, alphanums+'-_+/()')
    uptowords_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE).setParseAction(UpToWordsNode)
    uptochars_expr = Group(LBRACE + UPTO + integer("numberofchars") + CHARACTERS + RBRACE).setParseAction(UpToCharactersNode)
    some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
    phrase_item = some_words | uptowords_expr | uptochars_expr

    phrase_expr = infixNotation(phrase_item,
                                [
                                ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT, BeforeAfterJoinNode), # was not working earlier, because BEFORE etc. were not keywords, and hence parsed as words
                                (None, 2, opAssoc.LEFT, ConsecutivePhrases),
                                (AND, 2, opAssoc.LEFT, AndNode),
                                (OR, 2, opAssoc.LEFT, OrNode),
                                ],
                                lpar=Suppress('{'), rpar=Suppress('}')
                                ) # structure of a single phrase with its operators

    line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
                      (phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
    #
    line_contents_expr = infixNotation(line_term.setParseAction(LineTermNode),
                                       [(AND, 2, opAssoc.LEFT, LineAndNode),
                                        (OR, 2, opAssoc.LEFT, LineOrNode),
                                        ]
                                       ) # grammar for the entire rule/sentence
######################################
    mrrlist=[]
    for t in hrrsents:
        t = t.strip()
        if not t:
            continue
        try:
            parsed = line_contents_expr.parseString(t)
        except ParseException as pe:
            print(' '*pe.loc + '^')
            print(pe)
            continue


        temp_regex = parsed[0].generate()
        final_regexes3 = re.sub(r'gene','%s',temp_regex) # this can be made more precise by putting a condition of [non-word/^/$] around the 'gene'
        mrrlist.append(final_regexes3)
    return(mrrlist)

推荐答案

这里有两层语法,因此您最好一次只关注一层,我们在其他一些问题中也进行了介绍.较低的层是phrase_expr的层,以后将用作line_directive_expr的参数.因此,首先定义短语表达式的示例-从完整语句样本列表中将其提取出来.您为phrase_expr完成的BNF将具有最低级别的递归,如下所示:

You have a two-tiered grammar here, so you would do best to focus on one tier at a time, which we have covered in some of your other questions. The lower tier is that of the phrase_expr, which will later serve as the argument to the line_directive_expr. So define examples of phrase expressions first - extract them from your list of complete statement samples. Your finished BNF for phrase_expr will have the lowest level of recursion look like:

phrase_atom ::= <one or more types of terminal items, like words of characters 
                 or quoted strings, or *possibly* expressions of numbers of 
                 words or characters>  |  brace + phrase_expr + brace`

(其他一些问题:是否有可能一个接一个的没有多个运算符的多个词组项?这表示什么?应如何解析?应解释?该隐式操作是否应具有自己的优先级?)

(Some other questions: Is it possible to have multiple phrase_items one after another with no operator? What does that indicate? How should it be parsed? interpreted? Should this implied operation be its own level of precedence?)

这足以循环返回短语表达的递归-您在BNF中不需要任何其他braced_xxx元素. AND,OR和JOIN显然是二进制运算符-在正常的操作优先级中,AND的计算比OR的优先,您可以自己决定JOIN应当属于此的位置.用AND和JOIN以及OR和JOIN编写一些不带括号的示例短语,然后思考在您的域中哪种评估顺序有意义.

That will be sufficient to loop back the recursion for your phrase expression - you should not need any other braced_xxx element in your BNF. AND, OR, and JOIN are clearly binary operators - in normal operation precedence, AND's are evaluated before OR's, you can decide for yourself where JOIN should fall in this. Write some sample phrases with no parentheses, with AND and JOIN, and OR and JOIN, and think through what order of evaluation makes sense in your domain.

完成后,line_directive_expr应该很简单,因为它只是:

Once that is done, then line_directive_expr should be simple, since it is just:

line_directive_item ::= line_directive phrase_expr | brace line_directive_expr brace
line_directive_and ::= line_directive_item (AND line_directive_item)*
line_directive_or ::= line_directive_and (OR line_directive_and)*
line_directive_expr ::= line_directive_or

然后,当您转换为pyparsing时,一次添加组和结果名称​​ !不要立即将所有内容分组或命名.通常,我建议自由使用结果名称,但是在后缀表示法语法中,许多结果名称可能会使结果杂乱无章.让Group(以及最终的节点类)进行结构化,节点类中的行为将指导您所需的结果名称.因此,结果类通常具有如此简单的结构,以至于只需在类init中进行列表解压缩或评估方法通常会更容易. 从简单的表达式开始,直到复杂的表达式.(看看"LINE_STARTSWITH gene"-这是您最简单的测试用例之一,但您将其设为#97?)长度顺序,那将是一个很好的粗略选择.或按操作员数量的增加进行排序.但是在处理简单的案例之前先解决复杂的案例,您将有太多的选择可以进行调整或改进,并且(从个人经验出发)您很可能会错误地将其弄对,除非是在正确的时候.您弄错了,这只会使解决下一个问题变得更加困难.

Then when you translate to pyparsing, add Groups and results names a little at a time! Don't immediately Group everything or name everything. Ordinarily I recommend using results names liberally, but in infix notation grammars, lots of results names can just clutter up the results. Let the Group (and ultimately node classes) do the structuring, and the behavior in the node classes will guide you where you want results names. For that matter, the results classes usually get such a simple structure that it is often easier just to do list unpacking in the class init or evaluate methods. Start with simple expressions and work up to complicated ones. (Look at "LINE_STARTSWITH gene" - it is one of your simplest test cases, but you have it as #97?) If you just sort this list by length order, that would be a good rough cut. Or sort by increasing number of operators. But tackling the complex cases before you have the simple ones working, you will have too many options on where a tweak or refinement should go, and (speaking from personal experience) you are as likely to get it wrong as get it right - except when you get it wrong, it just makes fixing the next issue more difficult.

再一次,正如我们在其他地方讨论过的,第二层中的魔鬼正在对各种行指令项进行实际解释,因为隐含的评估LINE_STARTSWITH与LINE_CONTAINS的顺序覆盖了可能找到它们的顺序在初始字符串中.这个球完全在您的法庭上,因为您是该特定领域的语言设计师.

And again, as we have discussed elsewhere, the devil in this second tier is doing the actual interpretation of the various line directive items, since there is an implied order to evaluating LINE_STARTSWITH vs LINE_CONTAINS that overrides the order that they may be found in the initial string. That ball is entirely in your court, since you are the language designer for this particular domain.

这篇关于当表达式具有多种可能形式时,如何编写语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆