解析自定义文件格式python的技巧 [英] tips on Parsing a custom file format python

查看:103
本文介绍了解析自定义文件格式python的技巧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开发了一个自定义系统,可以模拟网络活动,例如下载文件等.我也有一个自定义文件格式,可以输入到该系统中.我希望将用perl编写的旧系统更改为使用python的较新系统.但是首先我必须以某种方式解析文件.

I developed a custom system which simulates web activity, for example downloading files and such. I also have a custom file format to feed into this system. I am looking to change this old system which is written in perl to a newer system in python. But first i have to somehow parse the file.

我想解析文件中的某些字段,例如[settings],其中我对系统有任何争议.我也有一个[macro]部分,这是重要内容(步骤等)的开头.

There are certain fields in the file that I would like to parse, such as the [settings] where I have any arguements for the system. I also have a [macro] section which is the beginning of the important stuff (the steps, etc).

我在解析这些部分时遇到的麻烦是让我的系统以另一种更简单的格式将其写出(我有成千上万个这样的文件,我只想编写一个生成器来获取旧文件并写入新文件格式化为新文件).

What i have trouble is parsing these sections have my system write it out in a different and much more simpler format (i have thousands of these files and I just want to write a generator to take the old file and write to a new format in a new file).

旧格式:

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0

[macro]
%::WebSurfRules =
    (
    'Step1' =>
        {
        action                  => 'NAVIGATE',
        inputstring             => 'http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',
        },
    'Step2' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'FirstClick\'\}',
        pass          => 'phHttpDest->\{\'Step2Pass\'\}',
        },
    'Step3' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'SecondClick\'\}',
        },
    'Step4' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'DealClick\'\}',
        accept_multi_match  => 'ANY_TOP_FIRST',
        },
    'Step5' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'INNER',
        matchstring   => 'phHttpDest->\{\'LinkClick2\'\}',
        fail          => 'Step6',
    #    accept_multi_match  => 'ANY_TOP_LAST',
        },
    'Step6' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'INNER',
        matchstring   => 'phHttpDest->\{\'DocClick\'\}',
        },
    'Step7' =>
        {
        action                  => 'CLICK_DOWNLOAD_OK',
        },
    );

[data]
Print WebAddress______________  Destination_________________________________________________ FirstClick_________________ SecondClick________________    DealClick_________________________   LinkClick2________________________  DocClick___________________________________ PayInterval   DueDay  Step2Pass__________     QaRule_________________________________________________________________________________________________________________
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf              Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Fund´s Allocation                           q1                    Step3                   qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}

我要吐出来的东西:

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
    (
    '1'     => 'NAVIGATE,phHttpDest->\{\'WebAddress\'\}', 
    '2'     => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',                                                         
    '3'     => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',                                 
    '4'     => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
    '5'     => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',                     
    '6'     => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',           
    );

[data]
Print WebAddress______________  Destination_________________________________________________ FirstClick_________________ SecondClick________________    DealClick_________________________   LinkClick2________________________  DocClick___________________________________ PayInterval   DueDay  Step2Pass__________     QaRule_________________________________________________________________________________________________________________
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf              Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Fund´s Allocation                           q1                    Step3                   qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}

每次点击phHttpDest的位置,并且动作与[data]部分的标题相关.

Where each of the clicks the phHttpDest and the action correlate to the Headings of the [data] section.

推荐答案

一种解决方法是使用一组正则表达式替换来创建新格式的文件.我不完全了解您的格式规则,因此我通常将整个过程都实现了,但是有一些区别.您必须进行一些调整才能对其进行微调.当您使用您的示例作为input.txt时,就会产生output.txt文件

So one way of doing it is using a set of regular expression replacements to create the files in the new format. I didn't completely understand the rules of your format so I generally implemented the whole thing, but there are some differences. You'll have to go in and make some adjustments to fine tune it. The output.txt file is what gets produced when one uses your example as input.txt

代码

import re
data = open('input.txt').read()
data = re.sub(r"    'Step([0-9]+)' =>\s+{\s+action\s+=> ", r"    '\1'     => ", data)
data = re.sub(r"',\s+pass\s+[^,]+,", "", data)
data = re.sub(r"',\s+accept_multi_match\s+[^,]+,", "", data)
data = re.sub(r"\n +#.*\n", "\n", data)
data = re.sub(r"',\s+fail\s+[^,]+,", "", data)
data = re.sub(r"',\s+matchtype\s+[^,]+,", "", data)
data = re.sub(r"',\s+inputstring\s+=> '", ",", data)
data = re.sub(r"\s+matchstring\s+=> '", ",", data)
data = re.sub(r"\n        },", "',", data)
open('output.txt', 'w').write(data)

output.txt

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0

[macro]
%::WebSurfRules =
    (
    '1'     => 'NAVIGATE,http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',',
    '2'     => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
    '3'     => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',',
    '4'     => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
    '5'     => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
    '6'     => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',',
    '7'     => 'CLICK_DOWNLOAD_OK',',
    );

...

这篇关于解析自定义文件格式python的技巧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆