awk根据坐标将文本添加到文件 [英] awk to add text to file based on coordinates

查看:101
本文介绍了awk根据坐标将文本添加到文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在下面的awk(执行但产生空输出)中,我将file1中的$4作为唯一ID,并将每个$1$2$3值读入变量chrminmax.

In the below awk (which executes but produces an empty output) I am using the $4 in file1 as a unique id and reading each $1, $2, and $3 value into a variable chr, min, and max.

然后将$4file2中的_上拆分,并读入array.拆分中的每个值都将匹配file1中的$4 id.chr需要匹配$1minmax必须在$2$3值之间c11>.

The $4 is then split on the _ in file2 and read into array. Each value in the split will match a $4 id in file1 The chr needs to match the $1, the min and max must be between the $2 and $3 values in file2.

不需要精确匹配,而只是minmax变量位于$2$3之内.如果为true,则在file1$5中打印exon,如果不是,则在$5中打印intron.

An exact match is not needed rather just that the min or max variables falls within $2 and $3. If that is true then exon is printed in $5 of file1, if it is not true then intron is printed in $5.

所需的输出已添加了exon/intron,但是还有另一部分调整了$2$3中的值,但是我想在询问之前编写脚本.我不确定下面是否是最好的方法,但希望这是一个开始.谢谢:).

The desired output has the exon/intron added to it but there is another part where the values in $2 or $3 are adjusted but I am trying to script that before I ask. I am not sure if the below is the best way but hopefully it is a start. Thank you :).

文件1 tab delimited,但$3$4

chr7    94027591    94027701    COL1A2
chr6    31980068    31980074    TNXB

文件2 tab delimited

chr7    94027059    94027070    COL1A2_cds_1_0_chr7_94027060_f  0   +
chr7    94027693    94027708    COL1A2_cds_2_0_chr7_94027694_f  0   +
chr6    32009125    32009227    TNXB_cds_0_0_chr6_32009126_r    0   -
chr6    32009547    32009711    TNXB_cds_1_0_chr6_32009548_r    0   -

所需的输出

chr7    94027683    94027701    COL1A2    exon
chr6    31980068    31980074    TNXB    intron

有评论

awk '
FNR==NR{ open block process matching line in file 1 and file2
 a[$4];  # use as a key with unique id
 chr[$4]=$1;  # store $1 value in chr
 min[$4]=$2;  # store $2 value in min
 max[$4]=$3;  # store $3 value in max
  next  # process next line
}  # close block
{  # open block
 split($4,array,"_");  # spilt $4 on underscore
 print $0,(array[1] in a) &&  ($2<=min[array[1]] && $2<=max[array[1] &&  $1=chr[array[1]])?"exon":"intron"
}' file1 OFS="\t" file2 > output  # close block, mention input with field separators and output

推荐答案

恕我直言,您显示的最终输出在逻辑上看起来不正确,因为Input_file2具有多个条目,而Input_file1仅具有单个条目(我仅显示示例) .您能检查一下一次吗?如果您的输出或逻辑有任何变化,请务必清楚地提及它们.

IMHO, your shown final output is NOT looking correct by logic, since Input_file2 has multiple entries and Input_file1 has only single ones(I am going by samples shown only). Could you please check this one once? If any changes in your output or logic then please do mention them clearly.

awk '
BEGIN{
  SUBSEP=","
}
FNR==NR{
  max[$1,$NF]=$3
  min[$1,$NF]=$2
  next
}
{
  split($4,array,"_")
}
(($1,array[1]) in max){
  if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
     print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
     next
  }
}
{
  print $0,"intron"
}'  Input_file1   Input_file2  | column -t

此命令正在执行的操作是检查Input_file2的第二字段或第三字段,它们是否位于Input_file1的第二字段和第三字段的范围内.如果其中有人来了,那么我将在其中打印Input_file1的输出添加exon,否则在最后打印出Input_file2的输出添加intron字符串.

What this command is doing it is checking Input_file2's 2nd field OR 3rd field either they are coming in range of Input_file1's 2nd and 3rd field. If anyone of them is coming then I am printing Input_file1's output adding exon in it or else printing Input_file2's output adding intron string at last of it.

这篇关于awk根据坐标将文本添加到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆