根据大小和模式在unix中拆分大文件 [英] Split big file in unix based on size and pattern

查看:48
本文介绍了根据大小和模式在unix中拆分大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的文件,45 GB.我想把它分成4部分.我可以这样做:split --bytes=12G inputfile.

I have a huge file, 45 GB. I want to split it into 4 parts. I can do this by: split --bytes=12G inputfile.

问题是它扰乱了文件的模式.此拆分根据大小剪切文件,因此不保留格式.我的输入文件如下所示:

Problem is it disturbs the pattern of the file. This split cut the file based on size so format is not preserved. My input file looks like this:

Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag
Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag

我想拆分文件,但还要提到在 Inspecting 中拆分文件的模式,以便我返回的拆分文件必须如下所示:

I want to split the file but also mention mention the pattern that split it at Inspecting so that splitted files I get back must look like this:

Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag
 V$ARNT_Q6_01           |      390 (+) |  1.000 |  0.998 | tACGTGgc

还有这个:

Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag
 V$ARNT_Q6_01           |      390 (+) |  1.000 |  0.998 | tACGTGgc

注意: 这种模式匹配应该是第二个偏好,而第一个应该是大小.例如,将文件拆分为 12 GB 的块,并根据 Inspecting 的模式匹配进行拆分.如果我仅根据模式 Inspecting 进行拆分,那么我将获得数千个拆分文件,因为此模式一次又一次地重复.

NOTE: This pattern matching should be a second preference while first should be the size. For example, split files into chuncks of 12 GB and split based on pattern match of Inspecting. If I do split just based on pattern Inspecting then I will get thousands of splitted file because this pattern is repeating again and again.

推荐答案

sed 来完成它会非常困难,因为到目前为止您没有简单的方法来跟踪读取的字符.使用 awk 会更容易:

Doint it with sed would be pretty difficult, since you have no easy way of keeping track of the characters read so far. It would be easier with awk:

BEGIN {
    fileno = 1
}
{
    size += length()
}
size > 100000 && /Inspecting/ {
    fileno++
    size = 0
}
{
    print $0 > "out" fileno;
}

根据您的需要调整大小.awk 在处理非常大的数字时可能会出现问题.出于这个原因,最好跟踪到目前为止读取的行数.

Adjust the size according to your needs. awkmight have problems handling very large numbers. For this reason it might be better to keep track of the number of lines read so far.

这篇关于根据大小和模式在unix中拆分大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆