AWK工作机智VCF(文本)文件 [英] AWK work wit vcf (text) file
问题描述
我想创建awk的code,这将modifie像这样的文字:
- 制表符分隔的所有列
- 删除这是由##文 开始的所有列
- 并保持头,这将启动的#header
我有这样的code,但它是不好的:
#!/斌/庆典
因为我
在*的.vcf;
做
awk的'BEGIN {打印CHROM \\ TPOS \\ TID \\ TREF \\ tALT \\ tQUAL \\ tFILT \\ TINFO \\ tFORMAT}'|
AWK'{$ 1\\ t$ 2\\ t$ 3\\ t的$ 4\\ t的$ 5\\ t的$ 6\\ t的$ 7\\ t的$ 8\\ t$ 9}'$ I |
AWK&GT'/#/'; $ {I%的.vcf} .tsv格式;
DONE
输入:
> ## FILEFORMAT = VCFv4.1
> ## FORMAT =< ID = GQX,总数= 1,类型=整数,说明={的质量基因型变种假设最低位置,基因型质量假设
>非变位置}>
> #CHROM POS ID REF ALT QUAL FILTER INFO格式1 CHR1 10385471 rs17401966 AG 100.00 PASS DP = 67; TI = NM_015074; GI = KIF1B; FC =无声GT:GQ:AD:VF:NL:SB:GQX 0/1:100 :29,38:0.5672:20:-100.0000:100
> CHR1 17380497 rs2746462 GT 100.00 PASS DP = 107; TI = NM_003000; GI = SDHB; FC = Synonymous_A6A外显子GT:GQ:AD:VF:NL:SB:GQX 1/1:100:0107:1.0000:20:-100.0000 :100
> CHR1 222045446 rs6691170克叔100.00 PASS DP = 99 GT:GQ:AD:VF:NL:SB:GQX 0/1:100:49,50:0.5051:20:-100.0000:100
OUTPUT:我想要什么
> CHROM POS ID REF ALT QUAL FILTER信息等..
> HR1 10385471 rs17401966一个
>摹100.00 PASS DP = 67; TI = NM_015074; GI = KIF1B;
您想要把你整个程序在一个单一的电话AWK:
在*的.vcf F;做
AWK
BEGIN {OFS =\\ t的}
/ ^ ## / {}旁边
/ ^#/ {子(/ ^#/,,$ 1)}
{$ 1 = $ 1;打印}
'$ F> $ {F / VCF%/ TSV}
DONE
这个程序将跳过与##开头的记录,将删除该有它的线条主导散列,然后用标签作为字段分隔符打印每行。
awk程序是一系列 {条件动作}
对。在输入每个记录,如果条件为真,则执行该操作块,否则它被忽略。如果省略的情况下,无条件地执行操作块
在这个例子中的一个棘手位 $ 1 = $ 1
- 当字段被修改,awk将重新建立记录,使用一个输出域加入字段( OFS
变量)。
I would like to create awk code, which will modifie text like this:
- Tab delimited all columns
- Delete all columns which is starting by "##text"
- And keep headers, which starts "#header"
I have this code, but it is not good:
#!/bin/bash
for i
in *.vcf;
do
awk 'BEGIN {print "CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILT\tINFO\tFORMAT"}' |
awk '{$1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7 "\t" $8 "\t" $9}' $i |
awk '!/#/' > ${i%.vcf}.tsv;
done
INPUT:
> ##fileformat=VCFv4.1
> ##FORMAT=<ID=GQX,Number=1,Type=Integer,Description="Minimum of {Genotype quality assuming variant position,Genotype quality assuming
> non-variant position}">
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 chr1 10385471 rs17401966 A G 100.00 PASS DP=67;TI=NM_015074;GI=KIF1B;FC=Silent GT:GQ:AD:VF:NL:SB:GQX 0/1:100:29,38:0.5672:20:-100.0000:100
> chr1 17380497 rs2746462 G T 100.00 PASS DP=107;TI=NM_003000;GI=SDHB;FC=Synonymous_A6A;EXON GT:GQ:AD:VF:NL:SB:GQX 1/1:100:0,107:1.0000:20:-100.0000:100
> chr1 222045446 rs6691170 G T 100.00 PASS DP=99 GT:GQ:AD:VF:NL:SB:GQX 0/1:100:49,50:0.5051:20:-100.0000:100
OUTPUT: What I want
> CHROM POS ID REF ALT QUAL FILTER INFO etc...
> hr1 10385471 rs17401966 A
> G 100.00 PASS DP=67;TI=NM_015074;GI=KIF1B;
You want to put your whole program in a single awk call:
for f in *.vcf; do
awk '
BEGIN {OFS = "\t"}
/^##/ {next}
/^#/ {sub(/^#/,"",$1)}
{$1=$1; print}
' "$f" > "${f/%vcf/tsv}"
done
This program will skip any record that begins with ##, will remove the leading hash for lines that have it, and then print each line using tab as the field separator.
awk programs are a series of condition {action}
pairs. For each record in the input, if the condition is true, the action block is performed, otherwise it is ignored. If the condition is omitted, the action block is performed unconditionally.
One tricky bit in this example is $1=$1
-- when fields are modified, awk will re-build the record, joining the fields using the output field separator (OFS
variable).
这篇关于AWK工作机智VCF(文本)文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!