文件通过awk或grep进行处理 [英] file proccessing by awk or grep

查看:91
本文介绍了文件通过awk或grep进行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理一个大的输入文件(2.9 GB)以产生特定格式的输出(如下所述):

输入文件的示例是:

  GS RSPH14 
CC建立HSA_Jul2014(GRCh38; hg38):chr22:23141092..23152092(REVERSE)
FT TFBS CHIP:FR000000873; SP1(Jurkat);结论:14980218; 23144712​​..23145380
FT TFBS CHIP:FR000643682; ER-ALPHA(MCF-7);结论:19339991; 23147445..23148194
FT TFBS CHIP:FR029934262; C / EBPBETA(A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108
GS CLXC15
CC Build HSA_Jul2014(GRCh38; hg38):chr3:23144021..23155021(REVERSE)
FT TFBS CHIP:FR000643682; ER-ALPHA(MCF-7);结论:19339991; 23147445..23148194
FT TFBS CHIP:FR034213319; CTCF(MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582

描述:输入文件中的每一行都以 GS CC FT ,我想忽略GS *行。对于CC *行,我想将它拆分为,并将第一个索引(基于0的计数) ,根据我的输入样本,它将是 chr22 (在第2行)和 chr3 (在第7行)。对于FT行,我想把它分成; ,并把第一个最后一个index (根据我的输入示例的第3行,它将是 SP1(Jurkat) 23144712​​..23145380 ,分别),并希望以这种方式处理它们,使得我的输出文件应该如下所示:

  chr22 23144712​​ 23145380 SP1 
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C / EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF


我的尝试:

strong>我可以在; 上拆分文件,以便获得所需的列。我试过的是: awk -F'[;]''{print $ 2'\t$ 4}'sample.txt> output.txt的。这使我输出为:

  hg38):chr22:23141092..23152092(REVERSE)
SP1(Jurkat) (A-549)23150853..23151108

hg38):chr3:23144021 .23155021(REVERSE)
ER-ALPHA(MCF-7)23147445..23148194
CTCF(MCF-7)23151393..23151582

现在从第一和第六行开始,我只想要 chr22 chr3 和其他行(非第1和第6个,原始以 GS> CC 开头))最后一列并在前面追加相应的字符。还应该处理其他行的第一个索引,以便在)上分割并保留第一个索引。

解决方案

使用awk:

  awk'
$ 1 ==CC{split($ 0 ,a,/:/); key = a [2]}
$ 1 ==FT{
n = split($ 0,a,/; /)
split(a [2 ],b,FS)
split(a [n],c,/[.]{2}/)
print key,c [1],c [2],b [1]

'档|列-t





  chr22 23144712​​ 23145380 SP1 
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C / EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF


I have to process a big input file (2.9 GB) to produce the output in a particular required format (describe below:)

Sample of input file is:

GS  RSPH14
CC  Build HSA_Jul2014 (GRCh38; hg38): chr22:23141092..23152092 (REVERSE)
FT  TFBS CHIP: FR000000873; SP1 (Jurkat); PMID:14980218; 23144712..23145380
FT  TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194
FT  TFBS CHIP: FR029934262; C/EBPBETA (A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108
GS  CLXC15
CC  Build HSA_Jul2014 (GRCh38; hg38): chr3:23144021..23155021 (REVERSE)
FT  TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194
FT  TFBS CHIP: FR034213319; CTCF (MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582

Description: Every line in input file starts with either GS or CC or FT, I want to ignore the GS* lines. For the CC* line, I want to split it on : and take the 1st index (0-based counting), according to my input sample it will be chr22 (in line 2) and chr3 (in line 7). For the FT line, I want to split it on ; and take the 1st and last index (according to my input sample's line 3 it will be SP1 (Jurkat) and 23144712..23145380, respectively) and want to proccess them in such a way that my output file should look like this:

chr22   23144712    23145380    SP1
chr22   23147445    23148194    ER-ALPHA
chr22   23150853    23151108    C/EBPBETA
chr3    23147445    23148194    ER-ALPHA
chr3    23151393    23151582    CTCF

Any help will be much appreciated!

My Try: I am able to split the file on ; so that I get my desired columns. What I tried is: awk -F'[;]' '{print $2 "\t" $4}' sample.txt > output.txt. This gives me output as:

 hg38): chr22:23141092..23152092 (REVERSE)  
 SP1 (Jurkat)    23144712..23145380
 ER-ALPHA (MCF-7)    23147445..23148194
 C/EBPBETA (A-549)   23150853..23151108

 hg38): chr3:23144021..23155021 (REVERSE)   
 ER-ALPHA (MCF-7)    23147445..23148194
 CTCF (MCF-7)    23151393..23151582

Now from the 1st and 6th line I only want chr22 and chr3 and from the other lines (non 1st and 6th which were originally starting with GS or CC) only the last column and append the corresponding chr in front. Also 1st index of other lines should be processed to split on ( and keep the 1st index.

解决方案

Using awk:

awk '
    $1 == "CC" { split($0, a, /:/); key=a[2] }
    $1 == "FT" {
        n = split($0, a, /;/)
        split(a[2], b, FS)
        split(a[n], c, /[.]{2}/)
        print key, c[1],c[2], b[1]
    }
' file | column -t

chr22  23144712  23145380  SP1
chr22  23147445  23148194  ER-ALPHA
chr22  23150853  23151108  C/EBPBETA
chr3   23147445  23148194  ER-ALPHA
chr3   23151393  23151582  CTCF

这篇关于文件通过awk或grep进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆