使用巨大的文本文件 [英] Working with Huge Text Files

查看:50
本文介绍了使用巨大的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我是一名Python新手,希望能够使用

文本文件的方向,范围从100MB到1G。基本上某些行,按第一个(主要)字段排序的
可能是第二个(日期),需要将
复制并写入自己的文件,以及一些字符串操作

也需要发生。当前格式的一个例子:


XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N

XYZ, 04JAN1993,9:30:28,28.87,1600,40,0,Z,N

|

|接着是类似上面的一百万行,

|递增日期和时间,然后转到下一个主要字段

|

ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N

|

|等等,每个文件通常有10-20个第一个字段

|所以有很多重复发生

|


出口理想情况下这样的第一个字段将是

写为文件名(XYZ.txt):


19930104,93027,2887,7600,40,0,Z,N


对新手来说相当雄心勃勃?我真的希望不是。我一直在看

simpleParse,但乍一看有点激烈......不确定

的起点,或者即使我需要走那条路。

你们有什么方向可以帮我们这个方向或者如何处理这个问题将会非常感谢

赞赏。


祝你好运,

Lorn

解决方案



Lorn Davies写道:

大家好,我是Python新手,他希望在工作
方面有一些方向,文本文件大小从100MB到1G不等。基本上某些
行,按第一个(主要)字段排序可能是第二个(日期),需要复制并写入自己的文件,并且还需要进行一些字符串操作
。当前格式的一个例子:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28, 28.87,1600,40,0,Z,N
|
|其次是类似上面的一百万行,用
|增加日期和时间,然后到下一个主要领域
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
|等,每个文件通常有10-20个第一个字段
|因此,有很多重复进行
|

导出理想情况下,第一个字段将
写为文件名(XYZ。 txt):

19930104,93027,2887,7600,40,0,Z,N

非常雄心勃勃的新手?我真的希望不是。我在simpleParse一直在寻找
,但乍一看有点激烈......不确定
的起点,或者即使我需要走那条路。来自你们的任何帮助
朝着什么方向前进或者如何接近这一点将非常值得赞赏。

致以最诚挚的问候,
Lorn




您可以使用csv模块。


这里是手册中的示例,您的样本数据在文件中

名为simple.csv:


import csv

reader = csv.reader(file(" some.csv"))
读取行中


打印行

"""

[''XYZ' ',''04JAN1993'',''9:30:27'',''28 .87'',''7600'',''40'',''0'',''Z'','' N'']

[''XYZ'',''04JAN1993'',''9:30:28'',''28 .87'',''1600'',''40 '','0'',''Z'',''N'']

[''ABC'',''04JAN1993'',''9:30:27' ','''28 .875'',''7600'',''40'',''0'',''Z' ',''N'']

"""


csv模块同时将每行作为字符串列表。

当然,你想在打印之前处理每一行。

你不想只打印它,你想把它写成文件。 />

所以读完第一行后,打开一个文件,用

第一个字段(row [0])作为文件名。然后你想要处理

字段row [1],row [2]和row [3]以使它们以正确的格式获得

然后写入所有行字段除了行[0]到'

打开写入的文件。


在每个后续行中你必须检查是否行[0]已更改,

因此您必须将row [0]存储在变量中。如果它已被更改,请关闭您写入的文件
并使用新的

行[0]打开一个新文件。然后像以前一样继续处理行。


只有这样才能保证原始的

文件实际上是按第一个字段排序的。 br />


我******** @ aol.com 写道:

如果你能保证原始的
文件实际上是按第一个字段排序的话,那就简单了。



如果没有,你可以提前对文件进行排序,或者只是保持

在必要时重新打开附加模式的文件。您可以在Python程序中对它们进行内存排序,但考虑到这些文件的大小,我认为其他替代方案之一会更简单。

-

Michael Hoffman



me ******** @ aol.com 写道:

Lorn Davies写道:

嗨那里,我'' ma Python新手希望在处理

文本文件时有一些方向,范围从100MB到1G。基本上某些


行,

按第一个(主要)字段排序可能是第二个(日期),需要复制并写入自己的文件,并且一些字符串操作也需要发生。当前格式的一个例子:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28, 28.87,1600,40,0,Z,N
|
|其次是类似上面的一百万行,用
|增加日期和时间,然后到下一个主要领域
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
|等,每个文件通常有10-20个第一个字段
|所以有很多重复进行
|

导出理想情况下看起来像第一个字段


写成文件名(XYZ.txt):

19930104,93027,2887,7600,40,0,Z,N

非常雄心勃勃新手?我真的希望不是。我一直在

simpleParse看


,但乍一看有点激烈......不确定
从哪里开始,甚至如果我需要走这条路。来自你的任何帮助



将会有什么方向或者如何处理这个问题将会非常感激。

最好的问候,
Lorn
您可以使用csv模块。

这里是手册中的示例,您的示例数据位于文件中
名为simple.csv:




显然,我的意思是some.csv。确保程序中的名称

与您要处理的文件匹配,或将输入文件名

作为参数传递给程序。

import csv
reader = csv.reader(file(" some.csv"))
读取行:
打印行

"" ;'
[''XYZ'',''04JAN1993'',''9:30:27'',''28 .87'','''7600'',''40'','' 0'',''Z'',''N
''] [''XYZ'',''04JAN1993'',''9:30:28'',''28 .87'',' '1600'',''40'',''0'',''Z'',''N
''] [''ABC'',''04JAN1993'',''9: 30:27'',''28 .875'',''7600'',''40'',''0'',''Z'',''N
'']""

csv模块将每一行作为一个字符串列表。
当然,你想在打印之前处理每一行。
你不要'' t jus我想打印它,你想把它写到一个文件。

所以读完第一行后,用
第一个字段(第[0行])打开一个文件写入文件名。然后你想处理
字段row [1],row [2]和row [3]以使它们以正确的格式
然后将除row [0]之外的所有行字段写入文件这是开放的写作。

在每个后续行,你必须检查行[0]是否已经改变,
所以你必须存储行变量中的[0]。如果它已被更改,请关闭您正在写入的文件并使用新的
行[0]打开一个新文件。然后像以前一样继续处理行。

如果你能保证原始的
文件实际上是按第一个字段排序的话,那就简单了。




Hi there, I''m a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there''s a lot of repetition going on
|

The export would ideally look like this where the first field would be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I''ve been looking at
simpleParse, but it''s a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn

解决方案


Lorn Davies wrote:

Hi there, I''m a Python newbie hoping for some direction in working with text files that range from 100MB to 1G in size. Basically certain rows, sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there''s a lot of repetition going on
|

The export would ideally look like this where the first field would be written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I''ve been looking at simpleParse, but it''s a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn



You could use the csv module.

Here''s the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
[''XYZ'', ''04JAN1993'', ''9:30:27'', ''28.87'', ''7600'', ''40'', ''0'', ''Z'', ''N '']
[''XYZ'', ''04JAN1993'', ''9:30:28'', ''28.87'', ''1600'', ''40'', ''0'', ''Z'', ''N '']
[''ABC'', ''04JAN1993'', ''9:30:27'', ''28.875'', ''7600'', ''40'', ''0'', ''Z'', ''N '']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don''t just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that''s
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you''ll have to store row[0] in a variable. If it''s changed, close
the file you''ve been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.


me********@aol.com wrote:

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.



And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.
--
Michael Hoffman



me********@aol.com wrote:

Lorn Davies wrote:

Hi there, I''m a Python newbie hoping for some direction in working with

text files that range from 100MB to 1G in size. Basically certain


rows,

sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there''s a lot of repetition going on
|

The export would ideally look like this where the first field would


be

written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I''ve been looking


at

simpleParse, but it''s a bit intense at first glance... not sure where to start, or even if I need to go that route. Any help from you


guys in

what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn
You could use the csv module.

Here''s the example from the manual with your sample data in a file
named simple.csv:



Obviously, I meant "some.csv". Make sure the name in the program
matches the file you want to process, or pass the input file name
to the program as an argument.

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
[''XYZ'', ''04JAN1993'', ''9:30:27'', ''28.87'', ''7600'', ''40'', ''0'', ''Z'', ''N ''] [''XYZ'', ''04JAN1993'', ''9:30:28'', ''28.87'', ''1600'', ''40'', ''0'', ''Z'', ''N ''] [''ABC'', ''04JAN1993'', ''9:30:27'', ''28.875'', ''7600'', ''40'', ''0'', ''Z'', ''N ''] """

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don''t just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that''s
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you''ll have to store row[0] in a variable. If it''s changed, close
the file you''ve been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.




这篇关于使用巨大的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆