如何解析代码(在Python中)? [英] How to parse code (in Python)?
问题描述
我需要解析一些特殊的数据结构。他们有点像C格式,大致如下:
组(GroupName){
/ * C样式注释* /
组(AnotherGroupName){
条目(some,variables,0,3.141);
条目(其他,变量,1,2.718);
}
条目(linebreaks,
allowed,
3,
1.414
);
}
我可以想到几个方法来解决这个问题。我可以使用正则表达式'代码化'代码。我可以一次读取代码一个字符,并使用状态机来构造我的数据结构。我可以摆脱逗号线,并逐行阅读。我可以编写一些将此代码转换为可执行Python代码的转换脚本。
有没有一个很好的pythonic方式来解析这样的文件?
如何你会去解析吗?
这是一个关于如何解析字符串而不是这么特别的文件格式的一般问题。
使用pyparsing(Mark Tolonen,我刚刚点击提交帖子,当你的帖子通过),这很简单 - 看到嵌入在以下代码:
data =Group(GroupName){
/ * C风格评论* /
组(AnotherGroupName){
条目(some,variables,0,3.141);
条目(其他,变量,1,2.718) ;
}
条目(linebreaks,
allowed,
3,
1.414
);
}$来自pyparsing import的
#定义基本标点符号数据类型
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,{}();)
GROUP =关键字(Group)
ENTRY =关键字)
#使用解析操作来对值进行解析转换
real = Regex(r[+ - ]?\d + \.\d *)。 setParseAction(lambda t:int(t [0]))
integer = Regex(r[+ - ]?\d +)setParseAction b
#解析一个用引号括起来的字符串,但在解析时剥离引号
string = QuotedString(''')
#定义结构表达式
值= string |真实的|整数
条目=组(ENTRY + LPAREN +组(可选(delimitedList(值))))+ RPAREN + SEMI
#由于组可以包含组,需要使用转发定义递归表达式
group = Forward()
group<<<组(GROUP + LPAREN + string(name)+ RPAREN +
LBRACE +组(ZeroOrMore(group | entry))(body)+ RBRACE)
#评论发生在哪里
group.ignore(cStyleComment)
#解析样本文本
result = group.parseString(data)
#打印出令牌作为一个不错的缩进列表,使用pprint
从pprint import pprint
pprint(result.asList())
打印
[['Group',
'GroupName',
[['Group',
'AnotherGroupName',
[['Entry',['some','variables',0,3.141]],
['Entry' 'other','variables',1,2.718]]]],
['Entry',['linebreaks','allowed',3,1.4139999999999999]]]]]
$ c (不幸的是,可能会有一些混乱,因为曲折定义了一个组类,用于将解析的令牌赋予结构 - 注意值如何列表在En尝试分组,因为列表表达式包含在一个引用组中。)
I need to parse some special data structures. They are in some somewhat-like-C format that looks roughly like this:
Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
}
I can think of several ways to go about this. I could 'tokenize' the code using regular expressions. I could read the code one character at a time and use a state machine to construct my data structure. I could get rid of comma-linebreaks and read the thing line by line. I could write some conversion script that converts this code to executable Python code.
Is there a nice pythonic way to parse files like this?
How would you go about parsing it?
This is more a general question about how to parse strings and not so much about this particular file format.
解决方案 Using pyparsing (Mark Tolonen, I was just about to click "Submit Post" when your post came thru), this is pretty straightforward - see comments embedded in the code below:
data = """Group("GroupName") {
/* C-Style comment */
Group("AnotherGroupName") {
Entry("some","variables",0,3.141);
Entry("other","variables",1,2.718);
}
Entry("linebreaks",
"allowed",
3,
1.414
);
} """
from pyparsing import *
# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")
# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')
# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI
# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN +
LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)
# ignore C style comments wherever they occur
group.ignore(cStyleComment)
# parse the sample text
result = group.parseString(data)
# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())
Prints
[['Group',
'GroupName',
[['Group',
'AnotherGroupName',
[['Entry', ['some', 'variables', 0, 3.141]],
['Entry', ['other', 'variables', 1, 2.718]]]],
['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]
(Unfortunately, there may be some confusion since pyparsing defines a "Group" class, for imparting structure to the parsed tokens - note how the value lists in an Entry get grouped because the list expression is enclosed within a pyparsing Group.)
这篇关于如何解析代码(在Python中)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!