两个文件之间的awk匹配时相交的区域(任何解决方案,欢迎) [英] Awk matching between two files when regions intersect (any solutions welcome)

查看:328
本文介绍了两个文件之间的awk匹配时相交的区域(任何解决方案,欢迎)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是建立在前面一个问题<一期建设href=\"http://stackoverflow.com/questions/12727108/awk-conditional-filter-one-file-based-on-another-or-other-solutions\">Awk基于另一个(或其他解决方案)有条件的过滤器一个文件

是在问题的底部快速汇总

我有一个从行,如果值该行比赛2另一个文本文件输出列在一个文本文件refGene.txt出3个值的awk程序。

我需要包括一个额外的标准,找到两个文件之间的匹配。的标准是列入如果在文件1的重叠用的两个值中refGene.txt一个行的范围的每一行中所指定的2位数字的值的范围。在文件1线的一个例子:

  10 CHR1 20
CHR2 10 20

和文件2(refGene.txt)匹配列($ 3 $ 5,$ 6)的一个示例行:

  CHR1 5月30日

目前因为虽然第一列匹配,无论是第二或第三列做没有awk程序不把这当作一场比赛。但我想办法把它当作一个比赛,因为文件1的区域10-20距离5-30在refGene.txt的范围内。然而,在文件1中的第二行不应匹配,因为第一列不匹配,这是必要的。如果有一种方法,包括情况下,当任何文件1的范围内与任何这将是非常有帮助的refGene.txt范围的重叠(因此部分重叠也算作一个匹配)。
它也应更换以下条件语句,因为它也将发现目前下面描述的所有情况下

所以总结
希望awk将打印的比赛,如果:
$ 1文件1文件2和火柴$ 3:
的$ 2- $ 3文件1的范围内相交于所有的$ 5 $ 6 file2的范围

请让我知道如果我的问题是不清楚。任何帮助真的是AP preciated,感谢它前进! (解决方案并不一定要在AWK)

Rubal

  FILES = /文件/ * TXT
在$ FILES F;
做    AWK
        开始 {
            FS =\\ t的;
        }
        FILENAME == ARGV [1] {
            对[$ 1,$ 2,$ 3] = 1;
            下一个;
        }
        {
            如果(对[$ 3 $ 5,$ 6] == 1){
                打印$ 13;
            }
        }
    '$($基本名F)/files/refGene.txt&GT; /文件/结果/ $(基名$ F);
DONE


解决方案

您只需要使用2个数组:

 的awk -F'\\ t''
  NR == FNR {分钟[$ 1] = $ 2;最大[$ 1] = $ 3;下一个}
  (以分钟$ 3)和放大器;&安培; (分[$ 3]&GT = $ 5)及与放大器; (最大值[$ 3]&LT; = $ 6){打印$ 13}

NR == FNR 只是另一种方式来写文件名== ARGV [1] - 它看起来在行号,而不是文件名。

This is building upon an earlier question Awk conditional filter one file based on another (or other solutions)

Quick summary at bottom of question

I have an awk program that outputs a column from rows in a text file 'refGene.txt if values in that row match 2 out of 3 values in another text file.

I need to include an additional criteria for finding a match between the two files. The criteria is inclusion if the range of the 2 numberical values specified in each row in file 1 overlap with the range of the two values in a row in refGene.txt. An example of a line in File 1:

chr1 10 20
chr2 10 20

and an example line in file 2(refGene.txt) of the matching columns ($3, $5, $ 6):

chr1 5 30

Currently the awk program does not treat this as a match because although the first column matches neither the 2nd or 3rd columns do no. But I would like a way to treat this as a match because the region 10-20 in file 1 is WITHIN the range of 5-30 in refGene.txt. However the second line in file 1 should NOT match because the first column does not match, which is necessary. If there is a way to include cases when any of the range in file 1 overlaps with any of the range in refGene.txt that would be really helpful (so partial overlap is also counted as a match). It should also replace the below conditional statements as it would also find all the cases currently described below.

So a summary: Want awk to print a match if: $1 in file1 matches $3 in file 2 AND: The range of $2-$3 in file1 intersects at all with the range of $5-$6 in file2

Please let me know if my question is unclear. Any help is really appreciated, thanks it advance! (solutions do not have to be in awk)

Rubal

FILES=/files/*txt   
for f in $FILES ;
do

    awk '
        BEGIN {
            FS = "\t";
        }
        FILENAME == ARGV[1] {
            pair[ $1, $2, $3 ] = 1;
            next;
        }
        {
            if ( pair[ $3, $5, $6 ] == 1 ) {
                print $13;
            }
        }
    ' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done

解决方案

You just need to use 2 arrays:

awk -F '\t' '
  NR == FNR {min[$1] = $2; max[$1] = $3; next}
  ($3 in min) && (min[$3] >= $5) && (max[$3] <= $6) {print $13}
'

NR==FNR is just another way to write FILENAME == ARGV[1] -- it looks at line numbers instead of filenames.

这篇关于两个文件之间的awk匹配时相交的区域(任何解决方案,欢迎)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆