有关如何解析自定义文件格式的提示 [英] Tips on how to parse custom file format

查看:118
本文介绍了有关如何解析自定义文件格式的提示的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对不起,标题模糊,但是我真的不知道如何简单地描述这个问题.

Sorry about the vague title, but I really don't know how to describe this problem concisely.

我已经创建了(或多或少)简单的域特定语言我将使用它来指定适用于不同实体(通常是从网页提交的表单)的验证规则.我在这篇文章的底部提供了一个示例,说明了该语言的外观.

I've created a (more or less) simple domain-specific language that I will to use to specify what validation rules to apply to different entities (generally forms submitted from a web page). I've included a sample at the bottom of this post of what the language looks like.

我的问题是我不知道如何开始将这种语言解析为可以使用的形式(我将使用Python进行解析).我的目标是最终得到应按顺序应用于每个对象/实体的规则/过滤器(作为字符串,包括参数,例如'cocoa(99)')(还包括字符串,例如'chocolate'等).

My problem is that I have no idea how to begin parsing this language into a form I can use (I will be using Python to do the parsing). My goal is to end up with a list of rules/filters (as strings, including arguments, e.g. 'cocoa(99)') that should be applied (in order) to each object/entity (also a string, e.g. 'chocolate', 'chocolate.lindt', etc.).

我不确定要使用哪种技术开始,甚至不确定存在什么技术来解决此类问题.您认为解决此问题的最佳方法是什么?我不是在寻找一个完整的解决方案,只是在正确的方向上轻轻推一下.

I'm not sure what technique to use to start with, or even what techniques exist for problems like this. What do you think is the best way of going about this? I'm not looking for a complete solution, just a general nudge in the right direction.

谢谢.

语言样本文件:

# Comments start with the '#' character and last until the end of the line
# Indentation is significant (as in Python)


constant NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`


*:      # Applies to all data
    isYummy             # Everything must be yummy

chocolate:              # To validate, say `validate("chocolate", object)`
    sweet               # chocolate must be sweet (but not necessarily chocolate.*)

    lindt:              # To validate, say `validate("chocolate.lindt", object)`
        tasty           # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        *:              # Applies to all data under chocolate.lindt
            smooth      # Could also be written smooth()
            creamy(1)   # Level 1 creamy
        dark:           # dark has no special validation rules
            extraDark:
                melt            # Filter that modifies the object being examined
                c:bitter        # Must be bitter, but only validated on client
                s:cocoa(NINETY_NINE)    # Must contain 99% cocoa, but only validated on server. Note constant
        milk:
            creamy(2)   # Level 2 creamy, overrides creamy(1) of chocolate.lindt.* for chocolate.lindt.milk
            creamy(3)   # Overrides creamy(2) of previous line (all but the last specification of a given rule are ignored)



ruleset food:       # To define a chunk of validation rules that can be expanded from the placeholder `food` (think macro)
    caloriesWithin(10, 2000)        # Unlimited parameters allowed
    edible
    leftovers:      # Nested rules allowed in rulesets
        stale

# Rulesets may be nested and/or include other rulesets in their definition



chocolate:              # Previously defined groups can be re-opened and expanded later
    ferrero:
        hasHazelnut



cake:
    tasty               # Same rule used for different data (see chocolate.lindt)
    isLie
    ruleset food        # Substitutes with rules defined for food; cake.leftovers must now be stale


pasta:
    ruleset food        # pasta.leftovers must also be stale




# Sample use (in JavaScript):

# var choc = {
#   lindt: {
#       cocoa: {
#           percent: 67,
#           mass:    '27g'
#       }
#   }
#   // Objects/groups that are ommitted (e.g. ferrro in this example) are not validated and raise no errors
#   // Objects that are not defined in the validation rules do not raise any errors (e.g. cocoa in this example)
# };
# validate('chocolate', choc);

# `validate` called isYummy(choc), sweet(choc), isYummy(choc.lindt), smooth(choc.lindt), creamy(choc.lindt, 1), and tasty(choc.lindt) in that order
# `validate` returned an array of any validation errors that were found

# Order of rule validation for objects:
# The current object is initially the object passed in to the validation function (second argument).
# The entry point in the rule group hierarchy is given by the first argument to the validation function.
# 1. First all rules that apply to all objects (defined using '*') are applied to the current object,
#    starting with the most global rules and ending with the most local ones.
# 2. Then all specific rules for the current object are applied.
# 3. Then a depth-first traversal of the current object is done, repeating steps 1 and 2 with each object found as the current object
# When two rules have equal priority, they are applied in the order they were defined in the file.



# No need to end on blank line

推荐答案

首先,如果您想学习解析,请编写自己的递归下降解析器.您定义的语言只需要少量的生产.我建议使用Python的tokenize库来避免将字节流转换为令牌流这一无聊的任务.

First off, if you want to learn about parsing, then write your own recursive descent parser. The language you've defined only requires a handful of productions. I suggest using Python's tokenize library to spare yourself the boring task of converting a stream of bytes into a stream of tokens.

有关实用的解析选项,请继续阅读...

For practical parsing options, read on...

一种快速而肮脏的解决方案是使用python本身:

A quick and dirty solution is to use python itself:

NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`

rules = {
  '*': {     # Applies to all data
    'isYummy': {},      # Everything must be yummy

    'chocolate': {        # To validate, say `validate("chocolate", object)`
      'sweet': {},        # chocolate must be sweet (but not necessarily chocolate.*)

      'lindt': {          # To validate, say `validate("chocolate.lindt", object)`
        'tasty':{}        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        '*': {            # Applies to all data under chocolate.lindt
          'smooth': {}  # Could also be written smooth()
          'creamy': 1   # Level 1 creamy
        },
# ...
    }
  }
}

有几种方法可以解决这个问题,例如,这是使用类的一种更干净的方法(尽管有些不寻常):

There are several ways to pull off this trick, e.g., here's a cleaner (albeit somewhat unusual) approach using classes:

class _:
    class isYummy: pass

    class chocolate:
        class sweet: pass

        class lindt:
            class tasty: pass

            class _:
                class smooth: pass
                class creamy: level = 1
# ...

作为完整解析器的中间步骤,您可以使用包含电池"的Python解析器,该解析器解析Python语法并返回AST. AST非常深,具有许多(IMO)不必要的级别.您可以通过剔除只有一个子节点的任何节点,将其过滤为更简单的结构.使用这种方法,您可以执行以下操作:

As an intermediate step to a full parser, you can use the "batteries-included" Python parser, which parses Python syntax and returns an AST. The AST is very deep with lots of (IMO) unnecessary levels. You can filter these down to a much simpler structure by culling any nodes that have only one child. With this approach you can do something like this:

import parser, token, symbol, pprint

_map = dict(token.tok_name.items() + symbol.sym_name.items())

def clean_ast(ast):
    if not isinstance(ast, list):
        return ast
    elif len(ast) == 2: # Elide single-child nodes.
        return clean_ast(ast[1])
    else:
        return [_map[ast[0]]] + [clean_ast(a) for a in ast[1:]]

ast = parser.expr('''{

'*': {     # Applies to all data
  isYummy: _,    # Everything must be yummy

  chocolate: {        # To validate, say `validate("chocolate", object)`
    sweet: _,        # chocolate must be sweet (but not necessarily chocolate.*)

    lindt: {          # To validate, say `validate("chocolate.lindt", object)`
      tasty: _,        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

      '*': {            # Applies to all data under chocolate.lindt
        smooth: _,  # Could also be written smooth()
        creamy: 1   # Level 1 creamy
      }
# ...
    }
  }
}

}''').tolist()
pprint.pprint(clean_ast(ast))

这种方法确实有其局限性.最终的AST仍然有点嘈杂,您定义的语言必须能够解释为有效的python代码.例如,您不能支持此...

This approach does have its limitations. The final AST is still a bit noisy, and the language you define has to be interpretable as valid python code. For instance, you couldn't support this...

*:
    isYummy

...因为此语法无法解析为python代码.但是,它的最大优点是您可以控制AST转​​换,因此无法注入任意Python代码.

...because this syntax doesn't parse as python code. Its big advantage, however, is that you control the AST conversion, so it is impossible to inject arbitrary Python code.

这篇关于有关如何解析自定义文件格式的提示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆