使用awk打印两对在两列之间具有重叠值范围的记录对 [英] Using awk to print pairs of records having overlapping range of values between two columns

查看:71
本文介绍了使用awk打印两对在两列之间具有重叠值范围的记录对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有与start($ 6)和stop($ 7)范围相对应的不同记录. 我要做的是打印出范围重叠的所有记录对.

I have different records corresponding to ranges with start($6) and stop($7). What I want to do is to print out all pairs of records having overlapping ranges.

例如,我的数据如下:

id1 0   376 . scaffold1 5165761 5166916 
id2 0   366 . scaffold1 2297244 2298403 
id3 155 456 . scaffold1 692777  693770 
id4 185 403 . scaffold1 102245  729675

我想要的是类似的结果

id3 id4

因为id4的范围与id3重叠. 我一直在互联网上搜索解决方案,但似乎没有任何办法可以解决我的问题.

because the range of id4 is overlapping with id3. I have been searching the solutions all over the internet but it seems there is nothing approaching to my problem.

如果有人能提供一些建议,我将不胜感激.

I would really appreciate if some might give some advice.

在遵循了以下答复中的一些建议之后,我确实尝试了此代码,该代码确实有效!

After following the advice of some from the below replies, I did try this code which did work !

awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -

处理时间非常短...对于具有1400条记录的文件,处理时间甚至不到1分钟.

The processing time was quite short...it was done after not even 1 minute for a file with 1400 records.

推荐答案

$ cat tst.awk
{
    beg[$1] = $6
    end[$1] = $7
    ids[++numIds] = $1
}
END {
    for (i=1; i<=numIds; i++) {
        idI = ids[i]
        for (j=1; j<=numIds; j++) {
            idJ = ids[j]
            if (idI != idJ) {
                if ( ( (beg[idI] >= beg[idJ]) && (beg[idI] <= end[idJ]) ) ||
                     ( (end[idI] >= beg[idJ]) && (end[idI] <= end[idJ]) ) ) {
                    if ( !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++ ) {
                        print idI, idJ
                    }
                }
            }
        }
    }
}

$ awk -f tst.awk file
id3 id4

您在问题中提供的输入文件不会涉及很多情况,因此,鉴于此输入文件中包含更多的重叠变体,

The input file you provided in your question doesn't cover many cases so given this input file with a lot more overlap variants in it:

$ cat file
id1 185 403 . scaffold1 10  20
id2 185 403 . scaffold1 11  19
id3 185 403 . scaffold1  9  10
id4 185 403 . scaffold1 20  21
id5 185 403 . scaffold1  9  11
id6 185 403 . scaffold1 19  21
id7 185 403 . scaffold1 10  20
id8 185 403 . scaffold1  1   8

尝试以上操作:

$ awk -f tst.awk file
id1 id3
id1 id4
id1 id5
id1 id6
id1 id7
id2 id1
id2 id5
id2 id6
id2 id7
id3 id5
id3 id7
id4 id6
id4 id7
id5 id7
id6 id7

与您在答案结尾处提供的脚本+管道:

vs the scripts + pipe you provided at the end of your answer:

$ awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -
id3 id5
id4 id6
id4 id7
id4 id1
id5 id3
id6 id7
id6 id1
id6 id2
id7 id3
id7 id5
id7 id1
id1 id3
id1 id5
id1 id7
id2 id5
id2 id7
id2 id1

,请注意,您的脚本两次报告了部分(但不是全部)ID之间的重叠:

and notice that your scripts report the overlap between some (but not all) of the ids twice:

id1 id7
id7 id1
id3 id5
id5 id3

而我的脚本仅在得到!seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++的帮助后才报告它们.

while my script only reports them once courtesy of !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++.

这篇关于使用awk打印两对在两列之间具有重叠值范围的记录对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆