一列值在Linux环境比较所有列 [英] Comparing one column value to all columns in linux enviroment

查看:146
本文介绍了一列值在Linux环境比较所有列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有两个文件,​​一个VCF看起来像

So I have two files , one VCF that looks like

88  Chr1    25  C   -   3   2   1   1
88  Chr1    88  A   T   7   2   1   1
88  Chr1    92  A   C   16  4   1   1

和其他与基因,看起来像

and another with genes that looks like

GENEID  Start END
GENE_ID 11 155
GENE_ID 165 999

我想一个脚本,看起来,如果有第二个文件的第二和第三位置的范围内的基因位置(VCF文件的第3列),然后将其打印出来。

I want a script that looks if there is a gene position (3rd column of VCF file) within the range of second and third position of the second file and then to print it out.

我所做的到目前为止是加入了文件,并做

What I did so far was to join the files and do

awk '{if (3>$12 && $3< $13) print }' > out

我所做的只比较加入了文件的当前行,我怎么可以把它列3的所有行比较列12和13中的所有行?(如果该值是同一行中只打印)

What I did only compares current rows of joined files (it only prints if the value is in the same row), how can I make it compare all rows of column 3 to all rows of column 12 and 13?

最佳,
SERG

Best, Serg

推荐答案

我希望能帮助(修改我改变code更高效的算法)

I hope to help (EDIT i change the code for more efficient algorithm)

gawk '
  #read input.genes and create list of limits (min, max)
  NR == FNR {
    #without header in input
    if(NR>1) {
      for(i=$2; i<=$3; i++){
        limits[i]=limits[i]","$2"-"$3;
      }
    };
    next
  }
  #read input.vcf, if column 3 is range of limits then print
  {
    if($3 in limits){
      print $0, "between("limits[$3]")"
    }
  }' input.genes input.vcf

您可以:

88  Chr1    25  C   -   3   2   1   1 between(,11-155)
88  Chr1    88  A   T   7   2   1   1 between(,11-155)
88  Chr1    92  A   C   16  4   1   1 between(,11-155)

这个算法Python是非常大的文件使用字典优化

This algorithm in python is optimized for very large file using dictionaries

limits = [line.strip().split() for line in open("input.genes")]
limits.pop(0) #remove the header
limits = [map(int,v[1:]) for v in limits]

dict_limits = {}
for start, finish in limits:
  for i in xrange(start, finish+1):
    if i not in dict_limits:
      dict_limits[i] = []
    dict_limits[i].append((start,finish))

OUTPUT = open("my_output.txt", "w")
for reg in open("input.vcf"):
  v_reg = reg.strip().split()
  if int(v_reg[2]) in dict_limits:
    OUTPUT.write(reg.strip() + "\tbetween({})\n".format(str(dict_limits[int(v_reg[2])])))

OUTPUT.close()

您可以:


88  Chr1    25  C   -   3   2   1   1   between([(11, 155)])
88  Chr1    88  A   T   7   2   1   1   between([(11, 155)])
88  Chr1    92  A   C   16  4   1   1   between([(11, 155)])

这篇关于一列值在Linux环境比较所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆