两个文件之间的awk匹配时相交的区域(任何解决方案,欢迎) [英] Awk matching between two files when regions intersect (any solutions welcome)
问题描述
这是建立在前面一个问题<一期建设href=\"http://stackoverflow.com/questions/12727108/awk-conditional-filter-one-file-based-on-another-or-other-solutions\">Awk基于另一个(或其他解决方案)有条件的过滤器一个文件
是在问题的底部快速汇总
我有一个从行,如果值该行比赛2另一个文本文件输出列在一个文本文件refGene.txt出3个值的awk程序。
我需要包括一个额外的标准,找到两个文件之间的匹配。的标准是列入如果在文件1的重叠用的两个值中refGene.txt一个行的范围的每一行中所指定的2位数字的值的范围。在文件1线的一个例子:
10 CHR1 20
CHR2 10 20
和文件2(refGene.txt)匹配列($ 3 $ 5,$ 6)的一个示例行:
CHR1 5月30日
目前因为虽然第一列匹配,无论是第二或第三列做没有awk程序不把这当作一场比赛。但我想办法把它当作一个比赛,因为文件1的区域10-20距离5-30在refGene.txt的范围内。然而,在文件1中的第二行不应匹配,因为第一列不匹配,这是必要的。如果有一种方法,包括情况下,当任何文件1的范围内与任何这将是非常有帮助的refGene.txt范围的重叠(因此部分重叠也算作一个匹配)。
它也应更换以下条件语句,因为它也将发现目前下面描述的所有情况下
所以总结的:
希望awk将打印的比赛,如果:
$ 1文件1文件2和火柴$ 3:
的$ 2- $ 3文件1的范围内相交于所有的$ 5 $ 6 file2的范围
请让我知道如果我的问题是不清楚。任何帮助真的是AP preciated,感谢它前进! (解决方案并不一定要在AWK)
Rubal
FILES = /文件/ * TXT
在$ FILES F;
做 AWK
开始 {
FS =\\ t的;
}
FILENAME == ARGV [1] {
对[$ 1,$ 2,$ 3] = 1;
下一个;
}
{
如果(对[$ 3 $ 5,$ 6] == 1){
打印$ 13;
}
}
'$($基本名F)/files/refGene.txt&GT; /文件/结果/ $(基名$ F);
DONE
您只需要使用2个数组:
的awk -F'\\ t''
NR == FNR {分钟[$ 1] = $ 2;最大[$ 1] = $ 3;下一个}
(以分钟$ 3)和放大器;&安培; (分[$ 3]&GT = $ 5)及与放大器; (最大值[$ 3]&LT; = $ 6){打印$ 13}
NR == FNR
只是另一种方式来写文件名== ARGV [1]
- 它看起来在行号,而不是文件名。
This is building upon an earlier question Awk conditional filter one file based on another (or other solutions)
Quick summary at bottom of question
I have an awk program that outputs a column from rows in a text file 'refGene.txt if values in that row match 2 out of 3 values in another text file.
I need to include an additional criteria for finding a match between the two files. The criteria is inclusion if the range of the 2 numberical values specified in each row in file 1 overlap with the range of the two values in a row in refGene.txt. An example of a line in File 1:
chr1 10 20
chr2 10 20
and an example line in file 2(refGene.txt) of the matching columns ($3, $5, $ 6):
chr1 5 30
Currently the awk program does not treat this as a match because although the first column matches neither the 2nd or 3rd columns do no. But I would like a way to treat this as a match because the region 10-20 in file 1 is WITHIN the range of 5-30 in refGene.txt. However the second line in file 1 should NOT match because the first column does not match, which is necessary. If there is a way to include cases when any of the range in file 1 overlaps with any of the range in refGene.txt that would be really helpful (so partial overlap is also counted as a match). It should also replace the below conditional statements as it would also find all the cases currently described below.
So a summary: Want awk to print a match if: $1 in file1 matches $3 in file 2 AND: The range of $2-$3 in file1 intersects at all with the range of $5-$6 in file2
Please let me know if my question is unclear. Any help is really appreciated, thanks it advance! (solutions do not have to be in awk)
Rubal
FILES=/files/*txt
for f in $FILES ;
do
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $3, $5, $6 ] == 1 ) {
print $13;
}
}
' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done
You just need to use 2 arrays:
awk -F '\t' '
NR == FNR {min[$1] = $2; max[$1] = $3; next}
($3 in min) && (min[$3] >= $5) && (max[$3] <= $6) {print $13}
'
NR==FNR
is just another way to write FILENAME == ARGV[1]
-- it looks at line numbers instead of filenames.
这篇关于两个文件之间的awk匹配时相交的区域(任何解决方案,欢迎)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!