传统数据解析 [英] Legacy data parsing

查看:53
本文介绍了传统数据解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我刚刚开始学习编程,并被告知这是一个很好的

地方提问:)


在我工作的地方,我们收到大量的数据,这些数据目前都是大型的,过时的点阵式打印机上印刷的数据。这是一个问题

,因为替换部件将不会更长时间可用。


所以我正在尝试创建一个程序来捕获固定宽度

文本文件数据和转换以及将数据(有几种不同的报告类型)排序成不同的格式,这样才能使它成为

可以正常打印,也可以在电脑上查看。


我一直在阅读正则表达式模块以及

的方式操纵字符串然而很难想到一个方法

来提取地址。


这里是原始文本的一个例子我必须与之合作:

地址信息/ RENSEIGNEMENTS SUR L''ADRESSE:

***************** ***********


FOR / POUR AL / LA:20

CORR TYP:A1B 2C3 P:3 CHNGD / CHANG

LANG:E CONS / REGR:#######

MRS XXX X XXXXXXX

# ## XXXXXXXXX ST DD TYP:P:6

CHNGD / CHANG

MONCTON NB LANG:E CONS / REGR:

#### ###

MRS XXX X XXXXXXX

#####

####

## # - ### - #


地址信息/ RENSEIGNEMENTS SUR L''ADRESSE:

************* ***************


FOR / POUR AL / LA:30

两种类型:A1B 2D3 P: 3 CHNGD / CHANG

LANG:E CONS / REGR:#######

MISS XXXX XXXXX

### XXXXXXXX ST

MONCTON NB


赚取丰厚的信息/调查信息基本收益:

********** *


(#=任意数字,X'只是常规文本)

我想提取地址信息,但是右侧的两个不同的

文本对象很难删除。我想

如果我能提取一个固定的

信息,那会更容易,但我不知道如何去做。


如果有人能给我关于排序这种类型的方法的建议

的数据,我们将不胜感激。

Hi,

I''ve just started to learn programming and was told this was a good
place to ask questions :)

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I''m trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.

I''ve been reading up on the Regular Expression module and ways in which
to manipulate strings however it has been difficult to think of a way
in which to extract an address.

Here''s an example of the raw text that I have to work with:
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L''ADRESSE:
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L''ADRESSE:
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********

(the # = any number, and the X''s are just regular text)
I would like to extract the address information, but the two different
text objects on the right hand side are difficult to remove. I think
it would be easier if I could just extract a fixed square of
information, but I don''t have a clue as to how to go about it.

If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.

推荐答案

gov写道:
gov wrote:



< snip>

如果有人能给我关于排序这类数据的方法的建议,我们将不胜感激。
Hi,
<snip>
If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.



也许吧这太过分了,但我会高度推荐David Mertz的优秀

书Python中的文本处理: http://gnosis.cx/TPiP/ 不知道

您需要做什么,但那个小小的气味闻起来像是需要

a状态机这本书有一个很好的,简单的e in(I

认为)第4章。


Jeremy Jones


Maybe it''s overkill, but I''d *highly* recommend David Mertz''s excellent
book "Text Processing in Python": http://gnosis.cx/TPiP/ Don''t know
what all you''re needing to do, but that small snip smells like it needs
a state machine which this book has an excellent, simple one in (I
think) chapter 4.

Jeremy Jones


Hello gov,
Hello gov,
这是我必须处理的原始文本的示例:


地址信息/ RENSEIGNEMENTS SUR L''ADRESSE:
****************************

FOR / POUR AL / LA:20
LANG:E CONS / REGR:#######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP :P:6
CHNGD / CHANG
MONCTON NB LANG:E CONS / REGR:
#######
MRS XXX X XXXXXXX
### ##
####
### - ### - #

地址信息/ RENSEIGNEMENTS SUR L''ADRESSE:
***** ***********************

FOR / POUR AL / LA:30
两种类型:A1B 2D3 P:3 CHNGD / CHANG
LANG:E CONS / REGR:#######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

收入重要信息/签名基本条款收益:
***********

(#=任意数字,而X'只是常规文本)
我想提取地址信息,但右侧有两个不同的文本对象很难删除。我认为如果我能提取一个固定的
信息会更容易,但我不知道该如何处理它。

如果有人能给我关于排序这类数据的方法的建议,我们将不胜感激。
Here''s an example of the raw text that I have to work with:


ADDRESS INFORMATION/RENSEIGNEMENTS SUR L''ADRESSE:
****************************

FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L''ADRESSE:
****************************

FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********

(the # = any number, and the X''s are just regular text)
I would like to extract the address information, but the two different
text objects on the right hand side are difficult to remove. I think
it would be easier if I could just extract a fixed square of
information, but I don''t have a clue as to how to go about it.

If anyone could give me suggestions as to methods in sorting this type
of data, it would be appreciated.



也许正则表达式太难了。我会尝试一下

解析工具包(比如PLY,PyParsing ......),这可能更合适

的工作。


HTH。

-

---------------------- --------------------------------------------------

Miki Tebeka< mi ********* @ zoran.com>
http://tebeka.bizhat.com

儿童和成人之间的唯一区别是玩具的价格


-----开始PGP SIGNATURE -----

版本:GnuPG v1.4.0(Cygwin)

iD8DBQFCzs5Y8jAdENsUuJsRAi3 + AJ0SLBJvK2MmmLzQDTx0Xb gY9d7ArQCgl02L

4U2vJdRK7zyiJpajE02KkoA =

= h7R +

-----结束PGP签名-----


Maybe regular expression are too difficult for this. I''d try one of the
parsing toolkits (such as PLY, PyParsing ...), it might be more suitable
for the job.

HTH.
--
------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
http://tebeka.bizhat.com
The only difference between children and adults is the price of the toys

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (Cygwin)

iD8DBQFCzs5Y8jAdENsUuJsRAi3+AJ0SLBJvK2MmmLzQDTx0Xb gY9d7ArQCgl02L
4U2vJdRK7zyiJpajE02KkoA=
=h7R+
-----END PGP SIGNATURE-----


gov写道:


我刚开始学习编程并被告知这是一个很好的
提问的地方:)

在我工作的地方,我们收到了大量数据目前全部印在大型,过时的点阵式打印机上。这是一个问题
因为更换部件将不再可用更长时间。

所以我正在尝试创建一个程序来捕获固定宽度的文本文件数据和转换以及将数据(有几种不同的报告类型)排序成不同的格式,这样就可以正常打印或在计算机上查看。
Hi,

I''ve just started to learn programming and was told this was a good
place to ask questions :)

Where I work, we receive large quantities of data which is currently
all printed on large, obsolete, dot matrix printers. This is a problem
because the replacement parts will not be available for much longer.

So I''m trying to create a program which will capture the fixed width
text file data and convert as well as sort the data (there are several
different report types) into a different format which would allow it to
be printed normally, or viewed on a computer.




这些报告是否都是相同的页面格式,固定宽度

列?如果是这样,那么关于状态机的建议听起来不错

- 只需运行状态机来确定你所在的线型,然后

解压固定宽度字段通过切片。


name = line [x:y]


如果这不起作用,那么pyparsing或DParser可能为你工作

更通用的解析器。



Are these reports all of the same page-wise format, with fixed-width
columns? If so, then the suggestion about a state machine sounds good
-- just run a state machine to figure out which linetype you''re on, then
extract the fixed width fields via slices.

name = line[x:y]

If that doesn''t work, then pyparsing or DParser might work for you as a
more general-purpose parser.


这篇关于传统数据解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆