使用Pyparsing访问解析的元素 [英] Access parsed elements using Pyparsing

查看:74
本文介绍了使用Pyparsing访问解析的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆句子,我需要解析它们并将其转换为相应的正则表达式搜索代码.我的句子示例-

I have a bunch of sentences which I need to parse and convert to corresponding regex search code. Examples of my sentences -

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

-这意味着在行中,phrase one出现在某处之前 phrase2phrase3.此外,该行必须以Therefore we

-This means in the line, phrase one comes somewhere before phrase2 and phrase3. Also, the line must start with Therefore we

LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr

-这意味着我需要在前两个词组之间允许最多4个词,并且 前2个词组之间最多3个词

-This means I need to allow upto 4 words between the first 2 phrases and upto 3 words between last 2 phrases

使用Paul Mcguire(此处)的帮助,编写了以下语法-

Using help from Paul Mcguire (here), the following grammar was written -

from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString, 
    infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)

LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
    """LINE_CONTAINS LINE_STARTSWITH """.split()) 

NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())

lpar=Suppress('{') 
rpar=Suppress('}')

keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR, 
                      BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use

phrase_word = ~keyword + (Word(alphanums + '_'))

upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)

phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))  



phrase_expr = infixNotation(phrase_term,
                            [
                             ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
                             (NOT, 1, opAssoc.RIGHT,),
                             (AND, 2, opAssoc.LEFT,),
                             (OR, 2, opAssoc.LEFT),
                            ],
                            lpar=Suppress('{'), rpar=Suppress('}')
                            ) # structure of a single phrase with its operators

line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                  Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
                                   [(NOT, 1, opAssoc.RIGHT,),
                                    (AND, 2, opAssoc.LEFT,),
                                    (OR, 2, opAssoc.LEFT),
                                    ]
                                   ) # grammar for the entire rule/sentence

sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""


我现在的问题是-如何访问已解析的元素,以便将句子转换为我的正则表达式代码.为此,我尝试了以下方法-


My question now is - How do I access the parsed elements in order to convert the sentences to my regex code. For this, I tried the following -

parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)

上述sample1代码的结果为-

{}

[[[['LINE_CONTAINS',[[['sentence','one'],'BEFORE',[['sentence2'], 'AND',['sentence3']]]]]],'AND',['LINE_STARTSWITH',[['因此', '我们']]]]]

[[['LINE_CONTAINS', [[['sentence', 'one'], 'BEFORE', [['sentence2'], 'AND', ['sentence3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]

([([([('[LINE_CONTAINS',([([(''sentence','one'],{}),'BEFORE', ([[['sentence2'],{}),'AND',([['sentence3'],{})],{})],{})],{})]], {'短语':[(([[[[[[[''sentence','one'],{}),'BEFORE', ([[['sentence2'],{}),'AND',([['sentence3'],{})],{})],{})],{}), 1)],' line_directive ':[('LINE_CONTAINS',0)]})),'AND', ([['LINE_STARTSWITH',([[['因此','我们',{})],{})],{'短语': [(([[[['因此,'我们'],{})],{}),1)],' line_directive ': [('LINE_STARTSWITH',0)]}]],{})],{})

([([(['LINE_CONTAINS', ([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {})], {'phrase': [(([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]}), 'AND', (['LINE_STARTSWITH', ([(['Therefore', 'we'], {})], {})], {'phrase': [(([(['Therefore', 'we'], {})], {}), 1)], 'line_directive': [('LINE_STARTSWITH', 0)]})], {})], {})

上述sample2代码的结果为-

{'短语':[[['abcd',{'单词数':4},'xyzw',{'单词数': 3},'pqrs'],'BEFORE',['something','else']]],' line_directive ': 'LINE_CONTAINS'}

{'phrase': [[['abcd', {'numberofwords': 4}, 'xyzw', {'numberofwords': 3}, 'pqrs'], 'BEFORE', ['something', 'else']]], 'line_directive': 'LINE_CONTAINS'}

[['LINE_CONTAINS',[[['abcd',['upto',4,'words'],'xyzw',['upto', 3,'words'],'pqrs'],'BEFORE',['something','else']]]]]]]

[['LINE_CONTAINS', [[['abcd', ['upto', 4, 'words'], 'xyzw', ['upto', 3, 'words'], 'pqrs'], 'BEFORE', ['something', 'else']]]]]

([[[[''LINE_CONTAINS',([([(['abcd',(['upto',4,'words'], {'单词数':[(4,1)]}),'xyzw',(['upto',3,'words'], {'字数':[(3,1)]}),'pqrs'],{}),'之前',(['something', 'else'],{})],{})],{})],{'phrase':[(([([([(''abcd',(['upto',4, 'words'],{'单词数':[(4,1)]}),'xyzw',(['upto',3,'words'], {'numberofwords':[(3,1)]}),'pqrs'],{}),'BEFORE',(['something', 'else'],{})],{})],{}),1)],' line_directive ':[('LINE_CONTAINS', 0)]})],{})

([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {})], {'phrase': [(([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]})], {})

基于以上输出,我的问题是-

My questions based on the above output are -

  1. 为什么pprint(漂亮的打印件)比普通的打印件具有更详细的解析?
  2. 为什么asDict()方法不提供sample1的输出,却提供sample2的输出?
  3. 每当我尝试使用print (parsed.numberofwords)parsed.line_directiveparsed.line_term访问已解析的元素时,都不会给我任何帮助.如何使用这些元素来构建正则表达式代码?
  1. Why does the pprint (pretty print) have more detailed parsing than normal print?
  2. Why does the asDict() method give no output for sample1 but does for sample2?
  3. Whenever I try to access the parsed elements using print (parsed.numberofwords) or parsed.line_directive or parsed.line_term, it gives me nothing. How can I access these elements in order to use them to build my regex codes?

推荐答案

回答印刷问题. 1)pprint可以漂亮地打印令牌的嵌套列表,而不显示任何结果名称-本质上,它是调用pprint.pprint(results.asList())的环绕. 2)asDict()可以将您解析的结果转换为实际的python字典,因此 only 仅显示结果名称(如果名称中包含名称,则使用嵌套).

To answer your printing questions. 1) pprint is there to pretty print a nested list of tokens, without showing any results names - it is essentially a wraparound for calling pprint.pprint(results.asList()). 2) asDict() is there to do conversion of your parsed results to an actual Python dict, so it only shows the results names (with nesting if you have names within names).

要查看已解析输出的内容,最好使用print(result.dump()). dump()同时显示结果的嵌套.

To view the contents of your parsed output, you are best off using print(result.dump()). dump() shows both the nesting of the results and any named results along the way.

result = line_contents_expr.parseString(sample2)
print(result.dump())

我还建议使用expr.runTests为您提供dump()输出以及任何异常和异常定位符.使用您的代码,您可以使用以下命令最轻松地做到这一点:

I also recommend using expr.runTests to give you dump() output as well as any exceptions and exception locators. With your code, you could most easily do this using:

line_contents_expr.runTests([sample1, sample2])

但是我也建议您退后一步,想一想这个{upto n words}业务的意义.查看您的样本并在直行词周围绘制矩形,然后在直行词内在短语术语周围绘制圆圈. (这是一个很好的练习,可以帮助您为自己编写此语法的BNF描述,我总是建议将其作为解决问题的第一步.)如果您处理upto表达式该怎么办作为另一个操作员?要查看此内容,请将phrase_term更改回原来的方式:

But I also suggest you step back a second and think about just what this {upto n words} business is all about. Look at your samples and draw rectangles around the line terms, and then within the line terms draw circles around the phrase terms. (This would be a good exercise in leading up to writing for yourself a BNF description of this grammar, which I always recommend as a getting-your-head-around-the-problem step.) What if you treated the upto expressions as another operator? To see this, change phrase_term back to the way you had it:

phrase_term = Group(OneOrMore(phrase_word))

然后将定义短语表达式的第一个优先项更改为:

And then change your first precedence entry in defining a phrase expression to:

    ((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,),

或者考虑将upto运算符的优先级设置为比BEFORE,AFTER和JOIN更高或更低,然后相应地调整优先级列表.

Or give some thought to maybe having upto operator at a higher or lower precedence than BEFORE, AFTER, and JOIN, and adjust the precedence list accordingly.

通过此更改,我从对您的示例调用runTests得到以下输出:

With this change, I get this output from calling runTests on your samples:

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

[[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
[0]:
  [['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]
  [0]:
    ['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]]
    - line_directive: 'LINE_CONTAINS'
    - phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]
      [0]:
        [['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]
        [0]:
          ['phrase', 'one']
        [1]:
          BEFORE
        [2]:
          [['phrase2'], 'AND', ['phrase3']]
          [0]:
            ['phrase2']
          [1]:
            AND
          [2]:
            ['phrase3']
  [1]:
    AND
  [2]:
    ['LINE_STARTSWITH', [['Therefore', 'we']]]
    - line_directive: 'LINE_STARTSWITH'
    - phrase: [['Therefore', 'we']]
      [0]:
        ['Therefore', 'we']



LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else

[['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]]
[0]:
  ['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]
  - line_directive: 'LINE_CONTAINS'
  - phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]
    [0]:
      [['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]
      [0]:
        ['abcd']
      [1]:
        ['upto', 4, 'words']
        - numberofwords: 4
      [2]:
        ['xyzw']
      [3]:
        ['upto', 3, 'words']
        - numberofwords: 3
      [4]:
        ['pqrs']
      [5]:
        BEFORE
      [6]:
        ['something', 'else']

您可以遍历这些结果并将它们分开,但是您很快就会达到应从不同优先级构建可执行节点的地步-有关如何执行此操作,请参见pyparsing Wiki上的SimpleBool.py示例.

You can iterate over these results and pick them apart, but you are rapidly reaching the point where you should look at building executable nodes from the different precedence levels - see the SimpleBool.py example on the pyparsing wiki for how to do this.

请查看phrase_expr的简化版本的解析器,以及如何创建自己生成输出的Node实例.在UpToNode类中查看如何在运算符上访问numberofwords.了解如何使用隐式AND运算符将"xyz abc"解释为"xyz AND abc".

Please review this pared-down version of a parser for phrase_expr, and how it creates Node instances that themselves generate the output. See how numberofwords is accessed on the operator in the UpToNode class. See how "xyz abc" gets interpreted as "xyz AND abc" with an implicit AND operator.

from pyparsing import *
import re

UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()

word = ~keyword + Word(alphas)
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)

class Node(object):
    def __init__(self, tokens):
        self.tokens = tokens

    def generate(self):
        pass

class LiteralNode(Node):
    def generate(self):
        return "(%s)" % re.escape(self.tokens[0])
    def __repr__(self):
        return repr(self.tokens[0])

class AndNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '.*'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])

class OrNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '|'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])

class UpToNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        ret = tokens[0].generate()
        word_re = r"\s+\S+"
        space_re = r"\s+"
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
        return ret

    def __repr__(self):
        tokens = self.tokens[0]
        ret = repr(tokens[0])
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
        return ret

IMPLICIT_AND = Empty().setParseAction(replaceWith("AND"))

phrase_expr = infixNotation(word.setParseAction(LiteralNode),
        [
        (upto_expr, 2, opAssoc.LEFT, UpToNode),
        (AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode),
        (OR, 2, opAssoc.LEFT, OrNode),
        ])

tests = """\
        xyz
        xyz abc
        xyz {upto 4 words} def""".splitlines()

for t in tests:
    t = t.strip()
    if not t:
        continue
    print(t)
    try:
        parsed = phrase_expr.parseString(t)
    except ParseException as pe:
        print(' '*pe.loc + '^')
        print(pe)
        continue
    print(parsed)
    print(parsed[0].generate())
    print()

打印:

xyz
['xyz']
(xyz)

xyz abc
['xyz' AND 'abc']
(xyz).*(abc)

xyz {upto 4 words} def
['xyz' {0-4 WORDS} 'def']
(xyz)((\s+\S+){0,4}\s+)(def)

对此进行扩展以支持您的LINE_xxx表达式.

Expand on this to support your LINE_xxx expressions.

这篇关于使用Pyparsing访问解析的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆