用pyparsing解析嵌套结构 [英] parsing nested structures with pyparsing

查看:115
本文介绍了用pyparsing解析嵌套结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析生物序列中位置的特定语法.职位可以采用以下形式:

I'm trying to parse a particular syntax for positions in biological sequences. The positions can have forms like:

12           -- a simple position in the sequence
12+34        -- a complex position as a base (12) and offset(+34)
12_56        -- a range, from 12 to 56
12+34_56-78  -- a range as a start to end, where either or both may be simple or complex

我想将这些解析为字典,大致像这样:

I'd like to have these parsed as dicts, roughly like this:

12          -> { 'start': { 'base': 12, 'offset': 0 },  'end': None }
12+34       -> { 'start': { 'base': 12, 'offset': 34 }, 'end': None }
12_56       -> { 'start': { 'base': 12, 'offset': 0 },
                   'end': { 'base': 56, 'offset': 0 } }
12+34_56-78 -> { 'start': { 'base': 12, 'offset': 0 }, 
                   'end': { 'base': 56, 'offset': -78 } }

我已经用pyparsing刺了几下.这是一个:

I've made several stabs using pyparsing. Here's one:

from pyparsing import *
integer = Word(nums)
signed_integer = Word('+-', nums)
underscore = Suppress('_')
position = integer.setResultsName('base') + Or(signed_integer,Empty).setResultsName('offset')
interval = position.setResultsName('start') + Or(underscore + position,Empty).setResultsName('end')

结果接近我想要的:

In [20]: hgvspyparsing.interval.parseString('12-34_56+78').asDict()
Out[20]: 
{'base': '56',
'end': (['56', '+78'], {'base': [('56', 0)], 'offset': [((['+78'], {}), 1)]}),
'offset': (['+78'], {}),
'start': (['12', '-34'], {'base': [('12', 0)], 'offset': [((['-34'], {}), 1)]})}

两个问题:

  1. asDict()仅适用于根parseResult.有没有办法哄骗pyparsing返回嵌套的dict(仅此)?

  1. asDict() only worked on the root parseResult. Is there a way to cajole pyparsing into returning a nested dict (and only that)?

如何获得范围末端和位置偏移的可选项?排名规则中的Or()不会削减它. (我对范围的末尾进行了类似的尝试.)理想情况下,我会将所有职位视为最复杂形式的特殊情况(即{开始:{base,end},end:{base,end}}),较简单的情况使用0或无.)

How do I get the optionality of the end of a range and the offset of a position? The Or() in the position rule doesn't cut it. (I tried similarly for the end of the range.) Ideally, I'd treat all positions as special cases of the most complex form (i.e., { start: {base, end}, end: { base, end } }), where the simpler cases use 0 or None.)

谢谢!

推荐答案

一些常规的pyparsing技巧:

Some general pyparsing tips:

Or(expr, empty)最好写为Optional(expr).另外,您的Or表达式正试图用Empty类创建一个Or,您可能想为第二个参数写Empty()empty.

Or(expr, empty) is better written as Optional(expr). Also, your Or expression was trying to create an Or with the class Empty, you probably meant to write Empty() or empty for the second argument.

expr.setResultsName("name")现在可以写为expr("name")

如果要对结果应用结构,请使用Group.

If you want to apply structure to your results, use Group.

使用dump()而不是asDict()可以更好地查看解析结果的结构.

Use dump() instead of asDict() to better view the structure of your parsed results.

这就是我要如何表达你的表情:

Here is how I would build up your expression:

from pyparsing import Word, nums, oneOf, Combine, Group, Optional

integer = Word(nums)

sign = oneOf("+ -")
signedInteger = Combine(sign + integer)

integerExpr = Group(integer("base") + Optional(signedInteger, default="0")("offset"))

integerRange = integerExpr("start") + Optional('_' + integerExpr("end"))


tests = """\
12
12+34
12_56
12+34_56-78""".splitlines()

for t in tests:
    result = integerRange.parseString(t)
    print t
    print result.dump()
    print result.asDict()
    print result.start.base, result.start.offset
    if result.end:
        print result.end.base, result.end.offset
    print

打印:

12
[['12', '0']]
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]})}
12 0

12+34
[['12', '+34']]
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]})}
12 +34

12_56
[['12', '0'], '_', ['56', '0']]
- end: ['56', '0']
  - base: 56
  - offset: 0
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]}), 'end': (['56', '0'], {'base': [('56', 0)], 'offset': [('0', 1)]})}
12 0
56 0

12+34_56-78
[['12', '+34'], '_', ['56', '-78']]
- end: ['56', '-78']
  - base: 56
  - offset: -78
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]}), 'end': (['56', '-78'], {'base': [('56', 0)], 'offset': [('-78', 1)]})}
12 +34
56 -78

这篇关于用pyparsing解析嵌套结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆