一列值在Linux环境比较所有列 [英] Comparing one column value to all columns in linux enviroment
问题描述
所以我有两个文件,一个VCF看起来像
So I have two files , one VCF that looks like
88 Chr1 25 C - 3 2 1 1
88 Chr1 88 A T 7 2 1 1
88 Chr1 92 A C 16 4 1 1
和其他与基因,看起来像
and another with genes that looks like
GENEID Start END
GENE_ID 11 155
GENE_ID 165 999
我想一个脚本,看起来,如果有第二个文件的第二和第三位置的范围内的基因位置(VCF文件的第3列),然后将其打印出来。
I want a script that looks if there is a gene position (3rd column of VCF file) within the range of second and third position of the second file and then to print it out.
我所做的到目前为止是加入了文件,并做
What I did so far was to join the files and do
awk '{if (3>$12 && $3< $13) print }' > out
我所做的只比较加入了文件的当前行,我怎么可以把它列3的所有行比较列12和13中的所有行?(如果该值是同一行中只打印)
What I did only compares current rows of joined files (it only prints if the value is in the same row), how can I make it compare all rows of column 3 to all rows of column 12 and 13?
最佳,
SERG
Best, Serg
推荐答案
我希望能帮助(修改我改变code更高效的算法)
I hope to help (EDIT i change the code for more efficient algorithm)
gawk '
#read input.genes and create list of limits (min, max)
NR == FNR {
#without header in input
if(NR>1) {
for(i=$2; i<=$3; i++){
limits[i]=limits[i]","$2"-"$3;
}
};
next
}
#read input.vcf, if column 3 is range of limits then print
{
if($3 in limits){
print $0, "between("limits[$3]")"
}
}' input.genes input.vcf
您可以:
88 Chr1 25 C - 3 2 1 1 between(,11-155)
88 Chr1 88 A T 7 2 1 1 between(,11-155)
88 Chr1 92 A C 16 4 1 1 between(,11-155)
这个算法Python是非常大的文件使用字典优化
This algorithm in python is optimized for very large file using dictionaries
limits = [line.strip().split() for line in open("input.genes")]
limits.pop(0) #remove the header
limits = [map(int,v[1:]) for v in limits]
dict_limits = {}
for start, finish in limits:
for i in xrange(start, finish+1):
if i not in dict_limits:
dict_limits[i] = []
dict_limits[i].append((start,finish))
OUTPUT = open("my_output.txt", "w")
for reg in open("input.vcf"):
v_reg = reg.strip().split()
if int(v_reg[2]) in dict_limits:
OUTPUT.write(reg.strip() + "\tbetween({})\n".format(str(dict_limits[int(v_reg[2])])))
OUTPUT.close()
您可以:
88 Chr1 25 C - 3 2 1 1 between([(11, 155)])
88 Chr1 88 A T 7 2 1 1 between([(11, 155)])
88 Chr1 92 A C 16 4 1 1 between([(11, 155)])
这篇关于一列值在Linux环境比较所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!