使用Pyparsing访问解析的元素 [英] Access parsed elements using Pyparsing
问题描述
我有一堆句子,我需要解析它们并将其转换为相应的正则表达式搜索代码.我的句子示例-
I have a bunch of sentences which I need to parse and convert to corresponding regex search code. Examples of my sentences -
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
-这意味着在行中,phrase one
出现在某处之前
phrase2
和phrase3
.此外,该行必须以Therefore we
-This means in the line, phrase one
comes somewhere before
phrase2
and phrase3
. Also, the line must start with Therefore we
LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr
-这意味着我需要在前两个词组之间允许最多4个词,并且 前2个词组之间最多3个词
-This means I need to allow upto 4 words between the first 2 phrases and upto 3 words between last 2 phrases
使用Paul Mcguire(此处)的帮助,编写了以下语法-
Using help from Paul Mcguire (here), the following grammar was written -
from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString,
infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)
LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
"""LINE_CONTAINS LINE_STARTSWITH """.split())
NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())
lpar=Suppress('{')
rpar=Suppress('}')
keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR,
BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use
phrase_word = ~keyword + (Word(alphanums + '_'))
upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)
phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))
phrase_expr = infixNotation(phrase_term,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""
我现在的问题是-如何访问已解析的元素,以便将句子转换为我的正则表达式代码.为此,我尝试了以下方法-
My question now is - How do I access the parsed elements in order to convert the sentences to my regex code. For this, I tried the following -
parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)
上述sample1
代码的结果为-
{}
[[[['LINE_CONTAINS',[[['sentence','one'],'BEFORE',[['sentence2'], 'AND',['sentence3']]]]]],'AND',['LINE_STARTSWITH',[['因此', '我们']]]]]
[[['LINE_CONTAINS', [[['sentence', 'one'], 'BEFORE', [['sentence2'], 'AND', ['sentence3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
([([([('[LINE_CONTAINS',([([(''sentence','one'],{}),'BEFORE', ([[['sentence2'],{}),'AND',([['sentence3'],{})],{})],{})],{})]], {'短语':[(([[[[[[[''sentence','one'],{}),'BEFORE', ([[['sentence2'],{}),'AND',([['sentence3'],{})],{})],{})],{}), 1)],' line_directive ':[('LINE_CONTAINS',0)]})),'AND', ([['LINE_STARTSWITH',([[['因此','我们',{})],{})],{'短语': [(([[[['因此,'我们'],{})],{}),1)],' line_directive ': [('LINE_STARTSWITH',0)]}]],{})],{})
([([(['LINE_CONTAINS', ([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {})], {'phrase': [(([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]}), 'AND', (['LINE_STARTSWITH', ([(['Therefore', 'we'], {})], {})], {'phrase': [(([(['Therefore', 'we'], {})], {}), 1)], 'line_directive': [('LINE_STARTSWITH', 0)]})], {})], {})
上述sample2
代码的结果为-
{'短语':[[['abcd',{'单词数':4},'xyzw',{'单词数': 3},'pqrs'],'BEFORE',['something','else']]],' line_directive ': 'LINE_CONTAINS'}
{'phrase': [[['abcd', {'numberofwords': 4}, 'xyzw', {'numberofwords': 3}, 'pqrs'], 'BEFORE', ['something', 'else']]], 'line_directive': 'LINE_CONTAINS'}
[['LINE_CONTAINS',[[['abcd',['upto',4,'words'],'xyzw',['upto', 3,'words'],'pqrs'],'BEFORE',['something','else']]]]]]]
[['LINE_CONTAINS', [[['abcd', ['upto', 4, 'words'], 'xyzw', ['upto', 3, 'words'], 'pqrs'], 'BEFORE', ['something', 'else']]]]]
([[[[''LINE_CONTAINS',([([(['abcd',(['upto',4,'words'], {'单词数':[(4,1)]}),'xyzw',(['upto',3,'words'], {'字数':[(3,1)]}),'pqrs'],{}),'之前',(['something', 'else'],{})],{})],{})],{'phrase':[(([([([(''abcd',(['upto',4, 'words'],{'单词数':[(4,1)]}),'xyzw',(['upto',3,'words'], {'numberofwords':[(3,1)]}),'pqrs'],{}),'BEFORE',(['something', 'else'],{})],{})],{}),1)],' line_directive ':[('LINE_CONTAINS', 0)]})],{})
([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {})], {'phrase': [(([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]})], {})
基于以上输出,我的问题是-
My questions based on the above output are -
- 为什么pprint(漂亮的打印件)比普通的打印件具有更详细的解析?
- 为什么
asDict()
方法不提供sample1
的输出,却提供sample2
的输出? - 每当我尝试使用
print (parsed.numberofwords)
或parsed.line_directive
或parsed.line_term
访问已解析的元素时,都不会给我任何帮助.如何使用这些元素来构建正则表达式代码?
- Why does the pprint (pretty print) have more detailed parsing than normal print?
- Why does the
asDict()
method give no output forsample1
but does forsample2
? - Whenever I try to access the parsed elements using
print (parsed.numberofwords)
orparsed.line_directive
orparsed.line_term
, it gives me nothing. How can I access these elements in order to use them to build my regex codes?
推荐答案
回答印刷问题. 1)pprint
可以漂亮地打印令牌的嵌套列表,而不显示任何结果名称-本质上,它是调用pprint.pprint(results.asList())
的环绕. 2)asDict()
可以将您解析的结果转换为实际的python字典,因此 only 仅显示结果名称(如果名称中包含名称,则使用嵌套).
To answer your printing questions. 1) pprint
is there to pretty print a nested list of tokens, without showing any results names - it is essentially a wraparound for calling pprint.pprint(results.asList())
. 2) asDict()
is there to do conversion of your parsed results to an actual Python dict, so it only shows the results names (with nesting if you have names within names).
要查看已解析输出的内容,最好使用print(result.dump())
. dump()
同时显示结果和的嵌套.
To view the contents of your parsed output, you are best off using print(result.dump())
. dump()
shows both the nesting of the results and any named results along the way.
result = line_contents_expr.parseString(sample2)
print(result.dump())
我还建议使用expr.runTests
为您提供dump()
输出以及任何异常和异常定位符.使用您的代码,您可以使用以下命令最轻松地做到这一点:
I also recommend using expr.runTests
to give you dump()
output as well as any exceptions and exception locators. With your code, you could most easily do this using:
line_contents_expr.runTests([sample1, sample2])
但是我也建议您退后一步,想一想这个{upto n words}
业务的意义.查看您的样本并在直行词周围绘制矩形,然后在直行词内在短语术语周围绘制圆圈. (这是一个很好的练习,可以帮助您为自己编写此语法的BNF描述,我总是建议将其作为解决问题的第一步.)如果您处理upto
表达式该怎么办作为另一个操作员?要查看此内容,请将phrase_term
更改回原来的方式:
But I also suggest you step back a second and think about just what this {upto n words}
business is all about. Look at your samples and draw rectangles around the line terms, and then within the line terms draw circles around the phrase terms. (This would be a good exercise in leading up to writing for yourself a BNF description of this grammar, which I always recommend as a getting-your-head-around-the-problem step.) What if you treated the upto
expressions as another operator? To see this, change phrase_term
back to the way you had it:
phrase_term = Group(OneOrMore(phrase_word))
然后将定义短语表达式的第一个优先项更改为:
And then change your first precedence entry in defining a phrase expression to:
((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,),
或者考虑将upto
运算符的优先级设置为比BEFORE,AFTER和JOIN更高或更低,然后相应地调整优先级列表.
Or give some thought to maybe having upto
operator at a higher or lower precedence than BEFORE, AFTER, and JOIN, and adjust the precedence list accordingly.
通过此更改,我从对您的示例调用runTests得到以下输出:
With this change, I get this output from calling runTests on your samples:
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
[[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
[0]:
[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]
[0]:
['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]]
- line_directive: 'LINE_CONTAINS'
- phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]
[0]:
[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]
[0]:
['phrase', 'one']
[1]:
BEFORE
[2]:
[['phrase2'], 'AND', ['phrase3']]
[0]:
['phrase2']
[1]:
AND
[2]:
['phrase3']
[1]:
AND
[2]:
['LINE_STARTSWITH', [['Therefore', 'we']]]
- line_directive: 'LINE_STARTSWITH'
- phrase: [['Therefore', 'we']]
[0]:
['Therefore', 'we']
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
[['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]]
[0]:
['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]
- line_directive: 'LINE_CONTAINS'
- phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]
[0]:
[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]
[0]:
['abcd']
[1]:
['upto', 4, 'words']
- numberofwords: 4
[2]:
['xyzw']
[3]:
['upto', 3, 'words']
- numberofwords: 3
[4]:
['pqrs']
[5]:
BEFORE
[6]:
['something', 'else']
您可以遍历这些结果并将它们分开,但是您很快就会达到应从不同优先级构建可执行节点的地步-有关如何执行此操作,请参见pyparsing Wiki上的SimpleBool.py示例.
You can iterate over these results and pick them apart, but you are rapidly reaching the point where you should look at building executable nodes from the different precedence levels - see the SimpleBool.py example on the pyparsing wiki for how to do this.
请查看phrase_expr
的简化版本的解析器,以及如何创建自己生成输出的Node
实例.在UpToNode
类中查看如何在运算符上访问numberofwords
.了解如何使用隐式AND运算符将"xyz abc"解释为"xyz AND abc".
Please review this pared-down version of a parser for phrase_expr
, and how it creates Node
instances that themselves generate the output. See how numberofwords
is accessed on the operator in the UpToNode
class. See how "xyz abc" gets interpreted as "xyz AND abc" with an implicit AND operator.
from pyparsing import *
import re
UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
word = ~keyword + Word(alphas)
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" % re.escape(self.tokens[0])
def __repr__(self):
return repr(self.tokens[0])
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '.*'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '|'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
class UpToNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = tokens[0].generate()
word_re = r"\s+\S+"
space_re = r"\s+"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
return ret
def __repr__(self):
tokens = self.tokens[0]
ret = repr(tokens[0])
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
return ret
IMPLICIT_AND = Empty().setParseAction(replaceWith("AND"))
phrase_expr = infixNotation(word.setParseAction(LiteralNode),
[
(upto_expr, 2, opAssoc.LEFT, UpToNode),
(AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
])
tests = """\
xyz
xyz abc
xyz {upto 4 words} def""".splitlines()
for t in tests:
t = t.strip()
if not t:
continue
print(t)
try:
parsed = phrase_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
print(parsed)
print(parsed[0].generate())
print()
打印:
xyz
['xyz']
(xyz)
xyz abc
['xyz' AND 'abc']
(xyz).*(abc)
xyz {upto 4 words} def
['xyz' {0-4 WORDS} 'def']
(xyz)((\s+\S+){0,4}\s+)(def)
对此进行扩展以支持您的LINE_xxx
表达式.
Expand on this to support your LINE_xxx
expressions.
这篇关于使用Pyparsing访问解析的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!