AWK找到重叠 [英] awk to find overlaps

查看:145
本文介绍了AWK找到重叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有列的文件,如下图所示。

I have a file with columns as shown below.

Group   Start        End
chr1    117132092    118875009
chr1    117027758    119458215
chr1    103756473    104864582
chr1    105093795    106219211
chr1    103354114    104747251
chr1    102741437    105235140
chr1    100090254    101094139
chr1    100426977    101614730
chr2    86644663     87767193
chr2    82473711     83636545
chr2    83896702     85079032
chr2    83876122     85091910
chr2    82943211     84350917
chr3    89410051     90485635
chr3    89405753     90485635
chr3    86491492     87593215
chr3    82507157     83738004
chr3    85059618     86362254

我想找到这些坐标之间的重叠各组(由CHR1,CHR2,CHR 3 ..分组)的

I would like to find the overlap between those coordinates in each group(grouped by chr1,chr2,chr3..).

的开始和结束坐标具有如果存在ATLEAST与其他在同一组50%的重叠情况进行检查。如果有ATLEAST重叠50%,新的开始和结束坐标在3和4栏要报告(其为交叠区域的范围内)。如果它们不重叠有报告在3和4栏。原来的开始和结束

The start and end coordinates has to be checked if there is atleast 50% overlap with the others in the same group. If there is atleast 50% overlap, the new start and end coordinates has to be reported in columns 3 and 4 (which is the range of the overlap region). If they don't overlap it has to report the original start and end in the columns 3 and 4.

为了使它更清晰,让我们前两行

To make it more clear, lets take the first two rows

                 117132092..........118875009
         117027758...........................119458215

由于两者彼此重叠ATLEAST 50%,重叠的范围被报告为新的开始,并在输出新的结束。和第3行和第4不与其他重叠,所以原始坐标被报告为新的开始和在柱3和4的新端和再次自行5和6具有50%的重叠彼此的范围被报告为新启动并在第3栏和4个新的结束。
 这里是预期的输出结果:

Since both of them overlap atleast 50% with each other, the range of the overlap is reported as new start and new end in the output. And Row 3 and 4 doesn't overlap with others and so the original coordinates are reported as new start and new end in column 3 and 4. And again since rows 5 and 6 have 50% overlap with each other their range is reported as new start and new end in column 3 and 4. Here is the expected output:

Group   Start     End         NewStart   NewEnd   
chr1 117132092 118875009  117027758   119458215
chr1 117027758 119458215  117027758   119458215
chr1 103756473 104864582  103354114   104864582
chr1 105093795 106219211  105093795   106219211
chr1 103354114 104747251  102741437   105235140
chr1 102741437 105235140  102741437   105235140
chr1 100090254 101094139  100090254   101614730
chr1 100426977 101614730  100090254   101614730
chr2 86644663 87767193    86644663    87767193
chr2 82473711 83636545    82473711    83636545 
chr2 83896702 85079032    83876122    85091910
chr2 83876122 85091910    83876122    85091910
chr2 82943211 84350917    82943211    84350917
chr3 89410051 90485635    89405753    90485635
chr3 89405753 90485635    89405753    90485635
chr3 86491492 87593215    86491492    87593215
chr3 82507157 83738004    82507157    83738004
chr3 85059618 86362254    85059618    86362254

我在R编写实现这一点,但原来的文件过于庞大,并需要很长的时间来运行。可能有人帮助这个在awk完成。

I have achieved this in R programming language but the original file is too huge and take a very long time to run. Could someone help this to do in awk.

推荐答案

使用了GNU AWK版本4,你可以尝试:

Using Gnu Awk version 4, you could try:

gawk -f a.awk file file

其中, a.awk 是:

NR==FNR {
    if (FNR>1) {
        a[$1][++i]=$2
        b[$1][i]=$3
    }
    next
}
FNR==1 {
    fmt="%-7s%-10s%-10s%-10s%-10s\n"
    printf fmt,"Group","Start","End","NewStart","NewEnd" 
}
FNR>1{
    $4=$2; $5=$3
    n=checkInside($1,$2,$3)
    if (n>0) {
        ff=0; x=$2; y=$3
        for (i=1; i<=n; i++) {
            ar=a[$1][R[i]]; br=b[$1][R[i]];
            getIntersect($2,$3,ar,br)
            getLargest($2,$3,ar,br)
            ovl=((i2-i1)/($3-$2))*100;
            ovr=((i2-i1)/(br-ar))*100;
            if (ovl>50 && ovr>50) {
                if (r1<x) x=r1
                if (r2>y) y=r2
                ff=1
            }
        }
        if (ff) {
            $4=x; $5=y
        }
    }
    printf fmt,$1,$2,$3,$4,$5
}

function getLargest(x1,y1,x2,y2) {
    r1=(x1<=x2)?x1:x2
    r2=(y1>=y2)?y1:y2
}

function getIntersect(x1,y1,x2,y2) {
    if (x1>=x2 && x1<=y2) {
        i1=x1;
    } else {
        i1=x2;
    }
    i2=(y1<=y2)?y1:y2
}

function checkInside(g,x,y,i,j,x1,y1) {
    R["x"]=0
    for (i in a[g]) {
        x1=a[g][i]; y1=b[g][i];
        if ((x>=x1 && x<=y1) || (y>=x1 && y<=y1)) {
            if (!(x==x1 && y==y1))
                R[++j]=i
        }
    }
    return j
}

输出:

Group  Start     End       NewStart  NewEnd    
chr1   117132092 118875009 117027758 119458215 
chr1   117027758 119458215 117027758 119458215 
chr1   103756473 104864582 103354114 104864582 
chr1   105093795 106219211 105093795 106219211 
chr1   103354114 104747251 102741437 105235140 
chr1   102741437 105235140 102741437 105235140 
chr1   100090254 101094139 100090254 101614730 
chr1   100426977 101614730 100090254 101614730 
chr2   86644663  87767193  86644663  87767193  
chr2   82473711  83636545  82473711  83636545  
chr2   83896702  85079032  83876122  85091910  
chr2   83876122  85091910  83876122  85091910  
chr2   82943211  84350917  82943211  84350917  
chr3   89410051  90485635  89405753  90485635  
chr3   89405753  90485635  89405753  90485635  
chr3   86491492  87593215  86491492  87593215  
chr3   82507157  83738004  82507157  83738004  
chr3   85059618  86362254  85059618  86362254  

这篇关于AWK找到重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆