用string / regex拆分大文件 [英] split large file by string/regex

查看:73
本文介绍了用string / regex拆分大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图用固定的字符串分割文件。

文件太大而不能将其读成字符串并拆分。

我可能会使用词法分析器,但可能还有更简单的东西?

谢谢

m。

解决方案

Martin Dieringer写道:

我试图用固定的字符串拆分文件。
文件太大而不能将其读入一个字符串并将其分开。
我可能会使用词法分析器,但可能还有更简单的东西吗?
感谢
m。




我想,取决于你对简单的定义。

*不是*使用词法分析器的问题是你必须按顺序检查文件

的重叠块以确保正则表达式可以选择所有

匹配。对于我来说,比使用词法分析器更复杂,鉴于SPARK和PLY等优秀的模块系列,仅提两个。


问候

Steve

-
http ://www.holdenweb.com
http://pydish.holdenweb。 com

Holden Web LLC +1 800 494 3119


2004年11月22日星期一上午09:38:55 + 0100,Martin Dieringer写道:

我试图用固定的字符串分割文件。
文件太大而不能将其读成字符串并拆分。
我可以使用词法分析器,但也许有更简单的东西?




如果模式包含在一行中,请执行以下操作:


import re

myre = re.compile(r''foo'')

fh = open(f)

fh1 =打开(f1,''w'')

s = fh.readline()

而不是myre.search(s):

fh1 .write(s)

s = fh.readline()

fh1.close()

fh2.open(f1,''w' ')

而fh

fh2.write(s)

s = fh.readline()

fh2。关闭()

fh.close()


我正在做这件事,所以这段代码几乎可以肯定

有错误。希望它足以让你开始......注意,只有

在任何时间点都会在内存中保存一行。哦,如果这个模式没有出现在文件中,那么你需要

在第一个while循环中检查eof。 br />

Jason


>我想,取决于你对简单的定义。

*不*使用词法分析器的问题是,您必须按重叠块的序列检查文件,以确保正则表达式可以拾取所有
匹配。对于我来说,比使用词法分析器更复杂,考虑到SPARK和PLY等优秀的模块,仅提两个。




At如果用作词法分析器/标记器,则至少spark会对整个字符串进行操作 - 你可以使用生成器来为它提供一个懒惰的令牌序列 - 但是

这是'由你决定。


-

问候,


Diez B. Roggisch



I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

解决方案

Martin Dieringer wrote:

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.



Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you''d have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119


On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?



If the pattern is contained within a single line, do something like this:

import re
myre = re.compile(r''foo'')
fh = open(f)
fh1 = open(f1,''w'')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,''w'')
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()

I''m doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there''s a
chance that the pattern does not appear in the file, you''ll need to
check for eof in the first while loop.

Jason


> Depends on your definition of "simple", I suppose. The problem with

*not* using a lexer is that you''d have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.



At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that''s up to you.

--
Regards,

Diez B. Roggisch


这篇关于用string / regex拆分大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆