AWK找到重叠 [英] awk to find overlaps
问题描述
我有列的文件,如下图所示。
I have a file with columns as shown below.
Group Start End
chr1 117132092 118875009
chr1 117027758 119458215
chr1 103756473 104864582
chr1 105093795 106219211
chr1 103354114 104747251
chr1 102741437 105235140
chr1 100090254 101094139
chr1 100426977 101614730
chr2 86644663 87767193
chr2 82473711 83636545
chr2 83896702 85079032
chr2 83876122 85091910
chr2 82943211 84350917
chr3 89410051 90485635
chr3 89405753 90485635
chr3 86491492 87593215
chr3 82507157 83738004
chr3 85059618 86362254
我想找到这些坐标之间的重叠各组(由CHR1,CHR2,CHR 3 ..分组)的
I would like to find the overlap between those coordinates in each group(grouped by chr1,chr2,chr3..).
的开始和结束坐标具有如果存在ATLEAST与其他在同一组50%的重叠情况进行检查。如果有ATLEAST重叠50%,新的开始和结束坐标在3和4栏要报告(其为交叠区域的范围内)。如果它们不重叠有报告在3和4栏。原来的开始和结束
The start and end coordinates has to be checked if there is atleast 50% overlap with the others in the same group. If there is atleast 50% overlap, the new start and end coordinates has to be reported in columns 3 and 4 (which is the range of the overlap region). If they don't overlap it has to report the original start and end in the columns 3 and 4.
为了使它更清晰,让我们前两行
To make it more clear, lets take the first two rows
117132092..........118875009
117027758...........................119458215
由于两者彼此重叠ATLEAST 50%,重叠的范围被报告为新的开始,并在输出新的结束。和第3行和第4不与其他重叠,所以原始坐标被报告为新的开始和在柱3和4的新端和再次自行5和6具有50%的重叠彼此的范围被报告为新启动并在第3栏和4个新的结束。
这里是预期的输出结果:
Since both of them overlap atleast 50% with each other, the range of the overlap is reported as new start and new end in the output. And Row 3 and 4 doesn't overlap with others and so the original coordinates are reported as new start and new end in column 3 and 4. And again since rows 5 and 6 have 50% overlap with each other their range is reported as new start and new end in column 3 and 4. Here is the expected output:
Group Start End NewStart NewEnd
chr1 117132092 118875009 117027758 119458215
chr1 117027758 119458215 117027758 119458215
chr1 103756473 104864582 103354114 104864582
chr1 105093795 106219211 105093795 106219211
chr1 103354114 104747251 102741437 105235140
chr1 102741437 105235140 102741437 105235140
chr1 100090254 101094139 100090254 101614730
chr1 100426977 101614730 100090254 101614730
chr2 86644663 87767193 86644663 87767193
chr2 82473711 83636545 82473711 83636545
chr2 83896702 85079032 83876122 85091910
chr2 83876122 85091910 83876122 85091910
chr2 82943211 84350917 82943211 84350917
chr3 89410051 90485635 89405753 90485635
chr3 89405753 90485635 89405753 90485635
chr3 86491492 87593215 86491492 87593215
chr3 82507157 83738004 82507157 83738004
chr3 85059618 86362254 85059618 86362254
我在R编写实现这一点,但原来的文件过于庞大,并需要很长的时间来运行。可能有人帮助这个在awk完成。
I have achieved this in R programming language but the original file is too huge and take a very long time to run. Could someone help this to do in awk.
推荐答案
使用了GNU AWK版本4,你可以尝试:
Using Gnu Awk version 4, you could try:
gawk -f a.awk file file
其中, a.awk
是:
NR==FNR {
if (FNR>1) {
a[$1][++i]=$2
b[$1][i]=$3
}
next
}
FNR==1 {
fmt="%-7s%-10s%-10s%-10s%-10s\n"
printf fmt,"Group","Start","End","NewStart","NewEnd"
}
FNR>1{
$4=$2; $5=$3
n=checkInside($1,$2,$3)
if (n>0) {
ff=0; x=$2; y=$3
for (i=1; i<=n; i++) {
ar=a[$1][R[i]]; br=b[$1][R[i]];
getIntersect($2,$3,ar,br)
getLargest($2,$3,ar,br)
ovl=((i2-i1)/($3-$2))*100;
ovr=((i2-i1)/(br-ar))*100;
if (ovl>50 && ovr>50) {
if (r1<x) x=r1
if (r2>y) y=r2
ff=1
}
}
if (ff) {
$4=x; $5=y
}
}
printf fmt,$1,$2,$3,$4,$5
}
function getLargest(x1,y1,x2,y2) {
r1=(x1<=x2)?x1:x2
r2=(y1>=y2)?y1:y2
}
function getIntersect(x1,y1,x2,y2) {
if (x1>=x2 && x1<=y2) {
i1=x1;
} else {
i1=x2;
}
i2=(y1<=y2)?y1:y2
}
function checkInside(g,x,y,i,j,x1,y1) {
R["x"]=0
for (i in a[g]) {
x1=a[g][i]; y1=b[g][i];
if ((x>=x1 && x<=y1) || (y>=x1 && y<=y1)) {
if (!(x==x1 && y==y1))
R[++j]=i
}
}
return j
}
输出:
Group Start End NewStart NewEnd
chr1 117132092 118875009 117027758 119458215
chr1 117027758 119458215 117027758 119458215
chr1 103756473 104864582 103354114 104864582
chr1 105093795 106219211 105093795 106219211
chr1 103354114 104747251 102741437 105235140
chr1 102741437 105235140 102741437 105235140
chr1 100090254 101094139 100090254 101614730
chr1 100426977 101614730 100090254 101614730
chr2 86644663 87767193 86644663 87767193
chr2 82473711 83636545 82473711 83636545
chr2 83896702 85079032 83876122 85091910
chr2 83876122 85091910 83876122 85091910
chr2 82943211 84350917 82943211 84350917
chr3 89410051 90485635 89405753 90485635
chr3 89405753 90485635 89405753 90485635
chr3 86491492 87593215 86491492 87593215
chr3 82507157 83738004 82507157 83738004
chr3 85059618 86362254 85059618 86362254
这篇关于AWK找到重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!