awk根据坐标将文本添加到文件 [英] awk to add text to file based on coordinates
问题描述
在下面的awk
(执行但产生空输出)中,我将file1
中的$4
作为唯一ID,并将每个$1
,$2
和$3
值读入变量chr
,min
和max
.
In the below awk
(which executes but produces an empty output) I am using the $4
in file1
as a unique id and reading each $1
, $2
, and $3
value into a variable chr
, min
, and max
.
然后将$4
在file2
中的_
上拆分,并读入array
.拆分中的每个值都将匹配file1
中的$4
id.chr
需要匹配$1
,min
和max
必须在$2
和$3
值之间c11>.
The $4
is then split on the _
in file2
and read into array
. Each value in the split will match a $4
id in file1
The chr
needs to match the $1
, the min
and max
must be between the $2
and $3
values in file2
.
不需要精确匹配,而只是min
或max
变量位于$2
和$3
之内.如果为true,则在file1
的$5
中打印exon
,如果不是,则在$5
中打印intron
.
An exact match is not needed rather just that the min
or max
variables falls within $2
and $3
. If that is true then exon
is printed in $5
of file1
, if it is not true then intron
is printed in $5
.
所需的输出已添加了exon/intron
,但是还有另一部分调整了$2
或$3
中的值,但是我想在询问之前编写脚本.我不确定下面是否是最好的方法,但希望这是一个开始.谢谢:).
The desired output has the exon/intron
added to it but there is another part where the values in $2
or $3
are adjusted but I am trying to script that before I ask. I am not sure if the below is the best way but hopefully it is a start. Thank you :).
文件1 tab delimited
,但$3
和$4
chr7 94027591 94027701 COL1A2
chr6 31980068 31980074 TNXB
文件2 tab delimited
chr7 94027059 94027070 COL1A2_cds_1_0_chr7_94027060_f 0 +
chr7 94027693 94027708 COL1A2_cds_2_0_chr7_94027694_f 0 +
chr6 32009125 32009227 TNXB_cds_0_0_chr6_32009126_r 0 -
chr6 32009547 32009711 TNXB_cds_1_0_chr6_32009548_r 0 -
所需的输出
chr7 94027683 94027701 COL1A2 exon
chr6 31980068 31980074 TNXB intron
有评论
awk '
FNR==NR{ open block process matching line in file 1 and file2
a[$4]; # use as a key with unique id
chr[$4]=$1; # store $1 value in chr
min[$4]=$2; # store $2 value in min
max[$4]=$3; # store $3 value in max
next # process next line
} # close block
{ # open block
split($4,array,"_"); # spilt $4 on underscore
print $0,(array[1] in a) && ($2<=min[array[1]] && $2<=max[array[1] && $1=chr[array[1]])?"exon":"intron"
}' file1 OFS="\t" file2 > output # close block, mention input with field separators and output
推荐答案
恕我直言,您显示的最终输出在逻辑上看起来不正确,因为Input_file2具有多个条目,而Input_file1仅具有单个条目(我仅显示示例) .您能检查一下一次吗?如果您的输出或逻辑有任何变化,请务必清楚地提及它们.
IMHO, your shown final output is NOT looking correct by logic, since Input_file2 has multiple entries and Input_file1 has only single ones(I am going by samples shown only). Could you please check this one once? If any changes in your output or logic then please do mention them clearly.
awk '
BEGIN{
SUBSEP=","
}
FNR==NR{
max[$1,$NF]=$3
min[$1,$NF]=$2
next
}
{
split($4,array,"_")
}
(($1,array[1]) in max){
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
next
}
}
{
print $0,"intron"
}' Input_file1 Input_file2 | column -t
此命令正在执行的操作是检查Input_file2的第二字段或第三字段,它们是否位于Input_file1的第二字段和第三字段的范围内.如果其中有人来了,那么我将在其中打印Input_file1的输出添加exon
,否则在最后打印出Input_file2的输出添加intron
字符串.
What this command is doing it is checking Input_file2's 2nd field OR 3rd field either they are coming in range of Input_file1's 2nd and 3rd field. If anyone of them is coming then I am printing Input_file1's output adding exon
in it or else printing Input_file2's output adding intron
string at last of it.
这篇关于awk根据坐标将文本添加到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!