文件通过awk或grep进行处理 [英] file proccessing by awk or grep
问题描述
我必须处理一个大的输入文件(2.9 GB)以产生特定格式的输出(如下所述):
输入文件的示例是:
GS RSPH14
CC建立HSA_Jul2014(GRCh38; hg38):chr22:23141092..23152092(REVERSE)
FT TFBS CHIP:FR000000873; SP1(Jurkat);结论:14980218; 23144712..23145380
FT TFBS CHIP:FR000643682; ER-ALPHA(MCF-7);结论:19339991; 23147445..23148194
FT TFBS CHIP:FR029934262; C / EBPBETA(A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108
GS CLXC15
CC Build HSA_Jul2014(GRCh38; hg38):chr3:23144021..23155021(REVERSE)
FT TFBS CHIP:FR000643682; ER-ALPHA(MCF-7);结论:19339991; 23147445..23148194
FT TFBS CHIP:FR034213319; CTCF(MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582
描述:输入文件中的每一行都以 GS
或 CC
或 FT
,我想忽略GS *行。对于CC *行,我想将它拆分为:
,并将第一个索引
(基于0的计数) ,根据我的输入样本,它将是 chr22
(在第2行)和 chr3
(在第7行)。对于FT行,我想把它分成;
,并把第一个
和最后一个index
(根据我的输入示例的第3行,它将是 SP1(Jurkat)
和 23144712..23145380
,分别),并希望以这种方式处理它们,使得我的输出文件应该如下所示:
chr22 23144712 23145380 SP1
或
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C / EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
$ c $
我的尝试:
strong>我可以在;
上拆分文件,以便获得所需的列。我试过的是:awk -F'[;]''{print $ 2'\t$ 4}'sample.txt> output.txt的
。这使我输出为:
hg38):chr22:23141092..23152092(REVERSE)
SP1(Jurkat) (A-549)23150853..23151108
hg38):chr3:23144021 .23155021(REVERSE)
ER-ALPHA(MCF-7)23147445..23148194
CTCF(MCF-7)23151393..23151582
现在从第一和第六行开始,我只想要
chr22
和chr3
和其他行(非第1和第6个,原始以GS> CC
开头))最后一列并在前面追加相应的字符。还应该处理其他行的第一个索引,以便在(
)上分割并保留第一个索引。
解决方案使用awk:
awk'
$ 1 ==CC{split($ 0 ,a,/:/); key = a [2]}
$ 1 ==FT{
n = split($ 0,a,/; /)
split(a [2 ],b,FS)
split(a [n],c,/[.]{2}/)
print key,c [1],c [2],b [1]
'档|列-t
chr22 23144712 23145380 SP1
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C / EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
I have to process a big input file (2.9 GB) to produce the output in a particular required format (describe below:)
Sample of input file is:
GS RSPH14 CC Build HSA_Jul2014 (GRCh38; hg38): chr22:23141092..23152092 (REVERSE) FT TFBS CHIP: FR000000873; SP1 (Jurkat); PMID:14980218; 23144712..23145380 FT TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194 FT TFBS CHIP: FR029934262; C/EBPBETA (A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108 GS CLXC15 CC Build HSA_Jul2014 (GRCh38; hg38): chr3:23144021..23155021 (REVERSE) FT TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194 FT TFBS CHIP: FR034213319; CTCF (MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582
Description: Every line in input file starts with either
GS
orCC
orFT
, I want to ignore the GS* lines. For the CC* line, I want to split it on:
and take the1st index
(0-based counting), according to my input sample it will bechr22
(in line 2) andchr3
(in line 7). For the FT line, I want to split it on;
and take the1st
andlast index
(according to my input sample's line 3 it will beSP1 (Jurkat)
and23144712..23145380
, respectively) and want to proccess them in such a way that my output file should look like this:chr22 23144712 23145380 SP1 chr22 23147445 23148194 ER-ALPHA chr22 23150853 23151108 C/EBPBETA chr3 23147445 23148194 ER-ALPHA chr3 23151393 23151582 CTCF
Any help will be much appreciated!
My Try: I am able to split the file on
;
so that I get my desired columns. What I tried is:awk -F'[;]' '{print $2 "\t" $4}' sample.txt > output.txt
. This gives me output as:hg38): chr22:23141092..23152092 (REVERSE) SP1 (Jurkat) 23144712..23145380 ER-ALPHA (MCF-7) 23147445..23148194 C/EBPBETA (A-549) 23150853..23151108 hg38): chr3:23144021..23155021 (REVERSE) ER-ALPHA (MCF-7) 23147445..23148194 CTCF (MCF-7) 23151393..23151582
Now from the 1st and 6th line I only want
chr22
andchr3
and from the other lines (non 1st and 6th which were originally starting withGS
orCC
) only the last column and append the corresponding chr in front. Also 1st index of other lines should be processed to split on(
and keep the 1st index.解决方案Using awk:
awk ' $1 == "CC" { split($0, a, /:/); key=a[2] } $1 == "FT" { n = split($0, a, /;/) split(a[2], b, FS) split(a[n], c, /[.]{2}/) print key, c[1],c[2], b[1] } ' file | column -t
chr22 23144712 23145380 SP1 chr22 23147445 23148194 ER-ALPHA chr22 23150853 23151108 C/EBPBETA chr3 23147445 23148194 ER-ALPHA chr3 23151393 23151582 CTCF
这篇关于文件通过awk或grep进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文