AWK:如果文件一栏落在其他文件两列声明的范围内提取线 [英] AWK: extract lines if column in file 1 falls within a range declared in two columns in other file

查看:140
本文介绍了AWK:如果文件一栏落在其他文件两列声明的范围内提取线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正与一个AWK的问题,我一直没能尚未解决挣扎。我有一个巨大的文件(30GB)与持有与位置的名单的基因组数据(在栏1中声明和2)和保持一个数字范围的第二列表(在第3栏第4声明和5)。我想提取在第一个文件中的所有行所在的位置秒文件中声明的范围内。作为位置是唯一的一个特定的染色体(字符)首先它已被如果字符的是相同的测试中是唯一的(即COL1在文件fil​​e2中1匹配COL3)

Currently I'm struggling with an AWK problem that I haven't been able to solve yet. I have one huge file (30GB) with genomic data that holds a list with positions (declared in col 1 and 2) and a second list that holds a number of ranges (declared in col 3, 4 and 5). I want to extract all lines in the first file where the position falls within the range declared in the seconds file. As the position is only unique within a certain chromosome (chr) first it has to be tested if the chr's are identical (ie. col1 in file 1 matches col3 in file2)

文件1

chromosome position another....hundred.....columns
chr1       816 .....
chr1       991 .....
chr2       816 .....
chr2       880 .....
chr2       18768 .....
...
chr22      9736286 .....

文件2

name    identifier chromosome   start    end
GENE1   ucsc.86    chr1         800      900
GENE2   ucsc.45    chr2         700      1700
GENE3   ucsc.46    chr2         18000    19000

期望的输出

chromosome position another....hundred.....columns
chr1       816 .....
chr2       816 .....
chr2       880 .....
chr2       18768 .....

什么我打算做一个夏日(半codeD):

A summery of what I intend to do in (half coded):

(if $1(in file 1) matches $3(in file 2){            ##test if in the correct chr
   if ($2(in file 1) >= $4 && =< $5 (in file 2){    ##test if pos is in the range
         print $0 (in file 1)                       ##if so print the row from file1
   }
}

我种,如果了解如何通过将数组中的file1和使用位置索引来解决这个问题,但我仍然有一个问题与CHR和除此之外文件1的方式,大到放在一个阵列(虽然我有128GB RAM)。我已经尝试了一些事情多维数组,但无法真正弄清楚如何做,要么。

I kind if understand how to solve this problem by putting file1 in an array and using position as the index but then I still have a problem with the chr and besides that file1 is way to big to put in an array (although I have 128GB of RAM). I've tried some things with multi-dimensional arrays but couldn't really figure out how to do that either.

非常感谢你的帮助。

更新14年8月5日
新增文件2第三条生产线在包含相同CHROM另一个范围。如在第二行。这条线被跳过下面的脚本。

Update 8/5/14 Added a third line in file 2 containing another range in the same chrom. as on the second line. This line is skipped in the script below.

推荐答案

在你的数据实际设置很大的修改问题上的变化。你引入了被用作键,因为键必须是唯一的它得到覆盖的元素。

The change in your data set actually modified the question greatly. You introduced an element which was used as a key and since keys have to be unique it got overwritten.

有关您的数据集,你最好做组合键。是这样的:

For your data set, you are better off making composite keys. Something like:

awk '
NR==FNR{ range[$3,$4,$5]; next }
FNR==1
{
    for(x in range) {
        split(x, check, SUBSEP); 
        if($1==check[1] && $2>=check[2] && $2<=check[3]) print $0
    }
}    
' file2 file1
chromosome position another....hundred.....columns
chr1       816 .....
chr2       816 .....
chr2       880 .....
chr2       18768

这篇关于AWK:如果文件一栏落在其他文件两列声明的范围内提取线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆