将伪代数字符串解析为命令 [英] Parsing pseudo-algebraic string into command

查看:18
本文介绍了将伪代数字符串解析为命令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含对象列表的字典

I have a dictionary containing a list of object as

objects = {'A1': obj_1,
    'A2': obj_2,
    }

然后我有一个字符串

cmd = '(1.3A1 + 2(A2 + 0.7A3)) or 2(A4 to A6)'

我想把它翻译成一个命令

I want to translate this to a command as

max( 1.3*objects['A1'] + 2*(objects['A2'] + 0.73*objects['A3']), 2*max(objects['A4'], objects['A5'], objects['A6']))

我的尝试

由于找不到更好的选择,我开始从头开始编写解析器.

My try

As I didn't find any better option, I started writing a parser from scratches.

个人注意事项:我不认为将 150 行代码附加到 SO 问题是好的做法,因为这意味着读者应该阅读并理解它,这是一项艰巨的任务.尽管如此,我之前的问题被否决了,因为我没有提出我的解决方案.所以你在这里...

PERSONAL NOTE: I don't think that attaching a 150-line code to a SO question is good practice as this will imply that the reader should read and understand it which is a demanding task. Nonetheless my previous question was downvoted because I didn't put my solution. So here you are...

import re
from more_itertools import stagger

def comb_to_py(string, objects):

    # Split the line
    toks = split_comb_string(string)

    # Escape for empty string
    if toks[0] == 'none':
        return []

    # initialize iterator
    # I could use a deque here. Let's see what works the best
    iterator = stagger(toks, offsets=range(2), longest=True)

    return comb_it_to_py(iterator, objects)


def split_comb_string(string):

    # Add whitespaces between tokes when they could be implicit to allow string
    # splitting i.e. before/after plus (+), minus and closed bracket
    string = re.sub(r' ?([\+\-)]) ?', r' \1 ', string)

    # remove double spaces
    string = re.sub(' +', ' ', string)

    # Avoid situations as 'A1 + - 2A2' and replace them with 'A1 - 2A2'
    string = re.sub(r'\+ *\-', r'-', string)
    # Avoid situations as 'A1 - - 2A2' and replace them with 'A1 + 2A2'
    string = re.sub(r'\- *\-', r'+', string)

    # Add whitespace after "(" (we do not want to add it in front of it)
    string = re.sub(r'\( ?', r'( ', string)

    return string.strip().split(' ')


def comb_it_to_py(iterator, objects):

    for items in iterator:

        # item[0] is a case token (e.g. 1.2A3)
        # This should occur only with the first element
        if re.fullmatch(r'([\d.]*)([a-zA-Z(]+\d*)', items[0]) is not None:
            res = parse_case(items[0], objects, iterator)


        elif items[0] == ')' or items[0] is None:
            return res


        # plus (+)
        elif items[0] == '+':
            # skip one position
            skip_next(iterator)

            # add following item
            res += parse_case(items[1], objects, iterator)


        # minus (-)
        elif items[0] == '-':
            # skip one position
            skip_next(iterator)

            # add following item
            res -= parse_case(items[1], objects, iterator)

        else:
            raise(ValueError(f'Invalid or misplaced token {items[0]}'))

    return res

def parse_case(tok, objects, iterator):
    # Translate a case string into an object.
    # It handles also brackets as "cases" calling comb_it_to_py recursively
    res = re.match(r'([\d.]*)(\S*)', tok)

    if res[1] == '':
        mult = 1
    else:
        mult = float(res[1])

    if res[2] == '(':
        return mult * comb_it_to_py(iterator, objects)
    else:
        return mult * objects[res[2]]


def skip_next(iterator):
    try:
        next(iterator)
    except StopIteration:
        pass


if __name__ == '__main__':

    from numpy import isclose
    def test(string, expected_result):
        try:
            res = comb_to_py(string, objects)
        except Exception as e:
            print(f"Error during test on '{string}'")
            raise e

        assert isclose(res.value, expected_result), f"Failed test on '{string}'"


    objects = {'A1': 1, 'A2':2, 'A10':3}

    test('A2', 2)
    test('1.3A2', 2.6)

    test('1.3A2 + 3A1', 5.6)
    test('1.3A2+ 3A1', 5.6)
    test('1.3A2 +3A1', 5.6)
    test('1.3A2+3A1', 5.6)

    test('1.3A2 - 3A1', -0.4)
    test('1.3A2 -3A1', -0.4)
    test('1.3A2- 3A1', -0.4)
    test('1.3A2-3A1', -0.4)

    test('1.3A2 + -3A1', -0.4)
    test('1.3A2 +-3A1', -0.4)
    test('1.3A2 - -3A1', 5.6)

    test('A1 + 2(A2+A10)', 25)
    test('A1 - 2(A2+A10)', -23)

    test('2(A2+A10) + A1', 25)
    test('2(A2+A10) - A1', 23)
    test('2(A2+A10) - -A1', 25)
    test('2(A2+A10) - -2A1', 26)

这段代码不仅冗长,而且非常容易破解.整个代码基于字符串的正确拆分,正则表达式部分只是为了确保字符串被正确拆分,这完全取决于字符串内空格的位置,即使 - 在这个特定的语法中 -大多数空格根本不应该被解析.

This code is not only lengthy, but also very easy to break. The whole code is based on the correct split of the string and the regex section is there only to be sure that the string is split correctly, which totally depend on the position of the whitespaces inside the string, even if - in this specific syntax - most whitespaces should not be parsed at all.

此外,这段代码仍然没有处理 or 关键字(其中 A 或 B 应该转换为 max(A,B)to 关键字(其中 A1 到 A9 应该在 max([Ai for Ai in range(A1, A9)]) 中转换).

Moreover, this code still doesn't handle the or keyword (where A or B should translate into max(A,B) and the to keyword (where A1 to A9 should translate in max([Ai for Ai in range(A1, A9)])).

对于此类任务,这是最好的方法还是有更强大的方法?

Is this the best approach or is there a more robust way for this type of tasks?

我查看了 pyparsing.它看起来是一种可能性,但是,如果我理解得很好,它应该被用作更强大的行拆分",而令牌仍然必须手动一个一个地转换为一个操作.这是正确的吗?

I gave a look to pyparsing. It looks as a possibility, but, if I understood well, it should be used as a more robust "line-splitting", while the tokens would still have to be translated to an operation one by one manually. Is this correct?

推荐答案

正则表达式本质上不适合涉及嵌套分组的括号的任务 - 您的伪代数语言 (PAL) 不是 常规语言.一个实际的解析器,例如 PyParsing (a PEG 解析器) 应改为使用.

Regular expressions are inherently unsuitable for a task involving parentheses for nested grouping – your pseudo-algebraic language (PAL) is not a regular language. An actual parser such as PyParsing (a PEG parser) should be used instead.

虽然这仍然需要从源代码转换为操作,但这可以在解析过程中直接执行.

While this still requires translating from source code to operations, this can be performed directly during parsing.

我们需要一些直接转换为 Python 原语的语言元素:

We need a few language elements that directly translate to Python primitives:

  • 数字文字,例如 1.3,如 int/float 文字或 fractions.Fraction.
  • 名称引用,例如 A3,作为 objects 命名空间的键.
  • 括号,例如 (...),通过括号分组:
    • 变体,例如 (1.3 or A3),作为 max 调用.
    • 名称范围,例如 A4 到 A6,作为 max 调用
    • + 二元运算符,作为 + 二元运算符.
    • Number literals, such as 1.3, as int/float literals or fractions.Fraction.
    • Name references, such as A3, as keys to the objects namespace.
    • Parentheses, such as (...), as grouping via parentheses for:
      • Variants, such as (1.3 or A3), as max calls.
      • Name ranges, such as A4 to A6, as max calls
      • The + binary operator, as the + binary operator.

      这种简单的语言同样适用于转译器或解释器——没有副作用或内省,因此没有一流对象、中间表示或 AST 的简单翻译就可以了.

      Such a simple language is equally suitable for a transpiler or interpreter – there are no side-effects or introspection, so a naive translation without first-class objects, intermediate representation or AST is fine.

      对于转译器,我们需要将 PAL 源代码转换为 Python 源代码.我们可以使用pyparsing 直接读取 PAL 并使用解析动作发出 Python.

      For a transpiler, we need to transform from PAL source code to Python source code. We can use pyparsing to directly read PAL and use a parse action to emit Python.

      最简单的情况是数字——PAL 和 Python 源代码是相同的.这是查看转译的一般结构的理想选择:

      The simplest case are numbers – both PAL and Python source are identical. This is ideal to look at the general structure of transpiling:

      import pyparsing as pp
      
      # PAL grammar rule: one "word" of sign, digits, dot, digits
      NUMBER = pp.Regex(r"-?\d+\.?\d*")
      
      # PAL -> Python transformation: Compute appropriate Python code
      @NUMBER.setParseAction
      def translate(result: pp.ParseResults) -> str:
          return result[0]
      

      请注意,setParseAction 通常与 lambda 一起使用,而不是修饰 def.然而,较长的变体更容易评论/注释.

      Note that setParseAction is commonly used with a lambda, instead of decorating a def. The longer variant is easier to comment/annotate, however.

      名称引用类似于解析,但需要对 Python 进行一些小的转换.我们仍然可以使用正则表达式,因为这里也没有嵌套.所有名称都将成为我们任意称为 objects 的单个全局命名空间的键.

      A name reference is similar to parse, but needs some minor translation to Python. We can still use regular expressions, since there is no nesting here either. All names will be keys to a single, global namespace that we arbitrarily call objects.

      NAME = pp.Regex(r"\w+\d+")
      
      @NAME.setParseAction
      def translate(result: pp.ParseResults) -> str:
          return f'objects["{result[0]}"]'   # interpolate key into namespace
      

      两个语法部分已经独立工作以进行转译.例如,NAME.parseString("A3") 提供了源代码 objects["A3"].

      Both grammar parts work independently for transpiling already. For example, NAME.parseString("A3") provides the source code objects["A3"].

      与终端/原始语法表达式不同,复合表达式必须引用其他表达式,可能是它们本身(此时,正则表达式失败).PyParsing 使用 Forward 表达式使这变得简单——这些是稍后定义的占位符.

      Unlike terminal/primitive grammar expressions, compound expressions must refer to other expressions, possibly themselves (at this point, regular expressions fail). PyParsing makes this simple with Forward expressions – these are placeholders which are defined later on.

      # placeholder for any valid PAL grammar element
      EXPRESSION = pp.Forward()
      

      没有运算符优先级,仅通过(...)进行分组,所有+orto> 工作类似.我们选择 作为演示者.

      Without operator precedence and just grouping via (...), all of +, or and to work similar. We pick or as a demonstrator.

      现在语法变得更加复杂:我们使用 pp.Suppress 来匹配但丢弃纯语法的 (/).我们使用 +/- 来组合几个语法表达式(- 意味着解析时没有替代方案).最后,我们使用前向引用 EXPRESSION 来引用所有其他的和这个表达式.

      The grammar gets more complicated now: We use pp.Suppress to match but discard the purely syntactical (/) and or. We use +/- to combine several grammar expressions (- means there are no alternatives when parsing). Finally, we use the forward reference EXPRESSION to refer to every other and this expression.

      SOME_OR = pp.Suppress("(") + EXPRESSION + pp.OneOrMore(pp.Suppress("or") - EXPRESSION) - pp.Suppress(")")
      
      @SOME_OR.setParseAction
      def translate(result: pp.ParseResults) -> str:
          elements = ', '.join(result)
          return f"max({elements})"
      

      名称范围和添加的工作原理基本相同,只是分隔符和输出格式发生了变化.隐式乘法更简单,因为它只适用于一对表达式.

      Name ranges and addition work fundamentally the same, only the delimiter and output formatting changes. Implicit multiplication is simpler in that it works only on a pair of expressions.

      此时,我们为每种种类的语言元素准备了一个转译器.可以使用相同的方法创建缺失的规则.现在,我们需要实际阅读源代码并运行转译后的代码.

      At this point, we have a transpiler for each kind of language element. The missing rules can be created with the same approach. Now, we need to actually read source code and run the transpiled code.

      我们首先将我们拥有的部分放在一起:将所有语法元素插入到前向引用中.我们还提供了一个方便的函数来抽象出 PyParsing.

      We start by putting together the pieces that we have: inserting all grammar elements into the forward reference. We also provide a convenience function to abstract away PyParsing.

      EXPRESSION << (NAME | NUMBER | SOME_OR)
      
      def transpile(pal: str) -> str:
          """Transpile PAL source code to Python source code"""
          return EXPRESSION.parseString(pal, parseAll=True)[0]
      

      为了运行一些代码,我们需要转译 PAL 代码评估具有一些命名空间的 Python 代码.由于我们的语法只允许安全输入,我们可以直接使用eval:

      In order to run some code, we need to transpile the PAL code and evaluate the Python code with some namespace. Since our grammar only allows safe input, we can use eval directly:

      def execute(pal, **objects):
          """Execute PAL source code given some object values"""
          code = transpile(pal)
          return eval(code, {"objects": objects})
      

      可以使用给定的 PAL 源和名称值运行此函数以计算等效的 Python 值:

      This function can be run with a given PAL source and name values to evaluate the equivalent Python value:

      >>> execute("(A4 or A3 or 13)", A3=42, A4=7)
      42
      

      为了完全支持 PAL,定义缺少的复合规则并将它们与其他规则一起添加到 EXPRESSION.

      For complete support of PAL, define the missing compound rules and add them alongside the others to EXPRESSION.

      这篇关于将伪代数字符串解析为命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆