拆分一个文件到基于模式的多个文件(可切割线内发生) [英] Split one file into multiple files based on pattern (cut can occur within lines)
问题描述
一个很多解决方案存在,但这里的特异性,我需要能够在一行内分割,切应该只是格局发生之前。例如:
A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:
INFILE:
<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>
应与模式成为&LT; XML
Outfile1:
<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>
Outfile2:
Outfile2:
<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>
Outfile3:
Outfile3:
<?xml 2><blabla><blabla>
在验证答案<一个其实 perl的
脚本href=\"http://stackoverflow.com/questions/8061475/split-one-file-into-multiple-files-based-on-pattern\">here正常工作对我的小例子。但它会为我的大(约6GB)实际文件错误。该错误是:
Actually the perl
script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:
panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.
我没有权限发表评论,这就是为什么我开始了新的岗位。
最后,在的Python
解决方案会更加AP preciated,因为我更好地理解它。
I don't have the permissions to comment, that's why I started a new post.
And finally, a Python
solution would be even more appreciated, as I understand it better.
推荐答案
这执行拆分不读一切都变成RAM:
This performs the split without reading everything into RAM:
def files():
n = 0
while True:
n += 1
yield open('/output/dir/%d.part' % n, 'w')
pat = '<?xml'
fs = files()
outfile = next(fs)
with open(filename) as infile:
for line in infile:
if pat not in line:
outfile.write(line)
else:
items = line.split(pat)
outfile.write(items[0])
for item in items[1:]:
outfile = next(fs)
outfile.write(pat + item)
一句警告:这不工作,如果在多行的条纹S $ P $垫(即包含\\ n)。考虑 MMAP解决如果是这样的话。
这篇关于拆分一个文件到基于模式的多个文件(可切割线内发生)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!