拆分一个文件到基于模式的多个文件(可切割线内发生) [英] Split one file into multiple files based on pattern (cut can occur within lines)

查看:190
本文介绍了拆分一个文件到基于模式的多个文件(可切割线内发生)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个很多解决方案存在,但这里的特异性,我需要能够在一行内分割,切应该只是格局发生之前。例如:

A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:

INFILE:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>

应与模式成为&LT; XML

Outfile1:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>

Outfile2:

Outfile2:

<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>

Outfile3:

Outfile3:

<?xml 2><blabla><blabla>

在验证答案<一个其实 perl的脚本href=\"http://stackoverflow.com/questions/8061475/split-one-file-into-multiple-files-based-on-pattern\">here正常工作对我的小例子。但它会为我的大(约6GB)实际文件错误。该错误是:

Actually the perl script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:

panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.

我没有权限发表评论,这就是为什么我开始了新的岗位。
最后,在的Python 解决方案会更加AP preciated,因为我更好地理解它。

I don't have the permissions to comment, that's why I started a new post. And finally, a Python solution would be even more appreciated, as I understand it better.

推荐答案

这执行拆分不读一切都变成RAM:

This performs the split without reading everything into RAM:

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

一句警告:这不工作,如果在多行的条纹S $ P $垫(即包含\\ n)。考虑 MMAP解决如果是这样的话。

这篇关于拆分一个文件到基于模式的多个文件(可切割线内发生)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆