AWK工作机智VCF(文本)文件 [英] AWK work wit vcf (text) file

查看:188
本文介绍了AWK工作机智VCF(文本)文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建awk的code,这将modifie像这样的文字:


  1. 制表符分隔的所有列

  2. 删除这是由##文
  3. 开始的所有列
  4. 并保持头,这将启动的#header

我有这样的code,但它是不好的:

 #!/斌/庆典
因为我
在*的.vcf;

    awk的'BEGIN {打印CHROM \\ TPOS \\ TID \\ TREF \\ tALT \\ tQUAL \\ tFILT \\ TINFO \\ tFORMAT}'|
    AWK'{$ 1\\ t$ 2\\ t$ 3\\ t的$ 4\\ t的$ 5\\ t的$ 6\\ t的$ 7\\ t的$ 8\\ t$ 9}'$ I |
    AWK&GT'/#/'; $ {I%的.vcf} .tsv格式;
DONE

输入:

 > ## FILEFORMAT = VCFv4.1
> ## FORMAT =< ID = GQX,总数= 1,类型=整数,说明={的质量基因型变种假设最低位置,基因型质量假设
>非变位置}>
> #CHROM POS ID REF ALT QUAL FILTER INFO格式1 CHR1 10385471 rs17401966 AG 100.00 PASS DP = 67; TI = NM_015074; GI = KIF1B; FC =无声GT:GQ:AD:VF:NL:SB:GQX 0/1:100 :29,38:0.5672:20:​​-100.0000:100
> CHR1 17380497 rs2746462 GT 100.00 PASS DP = 107; TI = NM_003000; GI = SDHB; FC = Synonymous_A6A外显子GT:GQ:AD:VF:NL:SB:GQX 1/1:100:0107:1.0000:20:-100.0000 :100
> CHR1 222045446 rs6691170克叔100.00 PASS DP = 99 GT:GQ:AD:VF:NL:SB:GQX 0/1:100:49,50:0.5051:20:-100.0000:100

OUTPUT:我想要什么

 > CHROM POS ID REF ALT QUAL FILTER信息等..
> HR1 10385471 rs17401966一个
>摹100.00 PASS DP = 67; TI = NM_015074; GI = KIF1B;


解决方案

您想要把你整个程序在一个单一的电话AWK:

 在*的.vcf F;做
    AWK
        BEGIN {OFS =\\ t的}
        / ^ ## / {}旁边
        / ^#/ {子(/ ^#/,,$ 1)}
        {$ 1 = $ 1;打印}
    '$ F> $ {F / VCF%/ TSV}
DONE

这个程序将跳过与##开头的记录,将删除该有它的线条主导散列,然后用标签作为字段分隔符打印每行。

awk程序是一系列 {条件动作} 对。在输入每个记录,如果条件为真,则执行该操作块,否则它被忽略。如果省略的情况下,无条件地执行操作块

在这个例子中的一个棘手位 $ 1 = $ 1 - 当字段被修改,awk将重新建立记录,使用一个输出域加入字段( OFS 变量)。

I would like to create awk code, which will modifie text like this:

  1. Tab delimited all columns
  2. Delete all columns which is starting by "##text"
  3. And keep headers, which starts "#header"

I have this code, but it is not good:

#!/bin/bash
for i
in *.vcf;
do
    awk 'BEGIN {print  "CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILT\tINFO\tFORMAT"}' |
    awk '{$1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7 "\t" $8 "\t" $9}' $i |
    awk '!/#/' > ${i%.vcf}.tsv;
done

INPUT:

> ##fileformat=VCFv4.1
> ##FORMAT=<ID=GQX,Number=1,Type=Integer,Description="Minimum of {Genotype quality assuming variant position,Genotype quality assuming
> non-variant position}">
> #CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1 chr1  10385471    rs17401966  A   G   100.00  PASS    DP=67;TI=NM_015074;GI=KIF1B;FC=Silent   GT:GQ:AD:VF:NL:SB:GQX   0/1:100:29,38:0.5672:20:-100.0000:100
> chr1  17380497    rs2746462   G   T   100.00  PASS    DP=107;TI=NM_003000;GI=SDHB;FC=Synonymous_A6A;EXON  GT:GQ:AD:VF:NL:SB:GQX   1/1:100:0,107:1.0000:20:-100.0000:100
> chr1  222045446   rs6691170   G   T   100.00  PASS    DP=99   GT:GQ:AD:VF:NL:SB:GQX   0/1:100:49,50:0.5051:20:-100.0000:100

OUTPUT: What I want

> CHROM POS   ID          REF  ALT  QUAL    FILTER  INFO             etc...
> hr1   10385471  rs17401966  A   
> G 100.00  PASS    DP=67;TI=NM_015074;GI=KIF1B;

解决方案

You want to put your whole program in a single awk call:

for f in *.vcf; do
    awk '
        BEGIN {OFS = "\t"}
        /^##/ {next}
        /^#/ {sub(/^#/,"",$1)}
        {$1=$1; print}
    ' "$f" > "${f/%vcf/tsv}"
done

This program will skip any record that begins with ##, will remove the leading hash for lines that have it, and then print each line using tab as the field separator.

awk programs are a series of condition {action} pairs. For each record in the input, if the condition is true, the action block is performed, otherwise it is ignored. If the condition is omitted, the action block is performed unconditionally.

One tricky bit in this example is $1=$1 -- when fields are modified, awk will re-build the record, joining the fields using the output field separator (OFS variable).

这篇关于AWK工作机智VCF(文本)文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆