用python处理重复结构化的文本文件 [英] Processing repeatedly structured text file with python

查看:12
本文介绍了用python处理重复结构化的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由块组成的大文本文件,例如:

I have a big text file structured in blocks like:

Student = {
        PInfo = {
                ID   = 0001;
            Name.First = "Joe";
            Name.Last = "Burger";
            DOB  = "01/01/2000";
        };
        School = "West High";
        Address = {
            Str1 = "001 Main St.";
            Zip = 12345;
        };
    };
    Student = {
        PInfo = {
            ID   = 0002;
            Name.First = "John";
            Name.Last = "Smith";
            DOB  = "02/02/2002";
        };
        School = "East High";
        Address = {
            Str1 = "001 40nd St.";
            Zip = 12346;
        };
        Club = "Football";
    };
    ....

学生块共享相同的条目,如PInfo"、学校"和地址",但其中一些可能有额外的条目,例如约翰史密斯"的俱乐部"信息不包括在乔·伯格".我想要做的是获取每个学生的姓名、学校名称和邮政编码并将它们存储在字典中,例如

The Student blocks share the same entries like "PInfo", "School" and "Address", but some of them may have additional entries, such as the "Club" information for "John Smith" which is not included for "Joe Burger". What I want to do is to get Name, School name and zip code of each student and store them in a dictionary, like

    {'Joe Burger':{School:'West High', Zip:12345}, 'John Smith':{School:'East High', Zip:12346}, ...}

作为python编程的新手,我尝试打开文件并逐行分析,但它看起来很麻烦.实际文件比我上面发布的示例大得多,也更复杂.我想知道是否有更简单的方法来做到这一点.先谢谢了.

Being new to python programming, I tried to open the file and analyze it line by line, but it looks so cumbersome. And the real file is quite large and more complicated than the example I posted above. I am wondering if there is an easier way to do it. Thanks ahead.

推荐答案

要解析文件,您可以定义描述输入格式的语法并使用它生成解析器.

To parse the file you could define a grammar that describes your input format and use it to generate a parser.

Python 中有许多语言解析器.例如,您可以使用 Grako,它采用 EBNF 作为输入,输出 memoizing PEG Python 解析器.

There are many language parsers in Python. For example, you could use Grako that takes grammars in a variation of EBNF as input, and outputs memoizing PEG parsers in Python.

要安装 Grako,请运行 pip install grako.

To install Grako, run pip install grako.

以下是使用 Grako 的 EBNF 语法风格的格式语法:

Here's grammar for your format using Grako's flavor of EBNF syntax:

(* a file is zero or more records *)
file = { record }* $;
record = name '=' value ';' ;
name = /[A-Z][a-zA-Z0-9.]*/ ;
value = object | integer | string ;
(* an object contains one or more records *)
object = '{' { record }+ '}' ;
integer = /[0-9]+/ ;
string = '"' /[^"]*/ '"';

要生成解析器,请将语法保存到文件中,例如 Structured.ebnf 并运行:

To generate parser, save the grammar to a file e.g., Structured.ebnf and run:

$ grako -o structured_parser.py Structured.ebnf

它创建了 structured_pa​​rser 模块,可用于从输入中提取学生信息:

It creates structured_parser module that can be used to extract the student information from the input:

#!/usr/bin/env python
from structured_parser import StructuredParser

class Semantics(object):
    def record(self, ast):
        # record = name '=' value ';' ;
        # value = object | integer | string ;
        return ast[0], ast[2] # name, value
    def object(self, ast):
        # object = '{' { record }+ '}' ;
        return dict(ast[1])
    def integer(self, ast):
        # integer = /[0-9]+/ ;
        return int(ast)
    def string(self, ast):
        # string = '"' /[^"]*/ '"';
        return ast[1]

with open('input.txt') as file:
    text = file.read()
parser = StructuredParser()
ast = parser.parse(text, rule_name='file', semantics=Semantics())
students = [value for name, value in ast if name == 'Student']
d = {'{0[Name.First]} {0[Name.Last]}'.format(s['PInfo']):
     dict(School=s['School'], Zip=s['Address']['Zip'])
     for s in students}
from pprint import pprint
pprint(d)

输出

{'Joe Burger': {'School': u'West High', 'Zip': 12345},
 'John Smith': {'School': u'East High', 'Zip': 12346}}

这篇关于用python处理重复结构化的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆