如何使用awk根据数字范围将特定值添加到列 [英] how to use awk to add specific values to a column based on numeric ranges

查看:54
本文介绍了如何使用awk根据数字范围将特定值添加到列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据我的 bed_file 中的数字向我的文件 coverage_file 中添加一列.在我的 coverage_file 中,我在第二列中有位置,并且 bed_file 包含从第二列到第三列的位置范围,以及在第4列中的名称.我想添加相应的位置在 bed_file 范围内的 coverage_file 的每个位置的名称,并对其进行编号,因此我可以区分同一对象(contig)上的多个位置范围.希望我的示例数据更加清晰:

I'm trying to add a column to my file coverage_file based on numbers within my bed_file. In my coverage_file I have positions in the second column and the bed_file contains position ranges from second to third column together with a name in column 4. I would like to add the corresponding name for each position to the coverage_file within the range of the bed_file and also have it numbered, so I can distinguish between multiple position ranges on the same object (contig). Hope my example data makes is clearer:

#example data

#coverage file looks like:

#k141_xxx.xx are contigs (long sequences of DNA), where different genes can be found on.
#the second column is the current position on the individual contig
#the third column is the coverage on this position (not important here)
#the fourth column is the sample where the data comes from: A1..7 and B8..10

k141_102288 298 5 A4
k141_102288 298 5 A5
k141_102288 298 5 B8
k141_102288 298 5 B9
k141_102288 299 5 A4
k141_102288 299 5 A5
k141_102288 299 5 B9
k141_102288 300 5 A5
k141_102288 301 5 A5
k141_102511.0 8226 5 A5
k141_102511.0 8227 5 A5
k141_102511.0 8228 5 A5
k141_102511.0 8229 5 A5
k141_102511.0 8230 5 A5
k141_102511.0 8231 5 A5
k141_102511.0 8232 5 A5
k141_102511.0 8233 5 A5
k141_102511.0 8234 5 A5
k141_102511.0 9129 5 A6
k141_102511.0 9207 5 A6
k141_102511.0 9275 5 A7
k141_102511.0 9276 5 A7
k141_102511.0 9277 5 A7
k141_102511.0 9278 5 A7
k141_102511.0 9279 5 A7
k141_102511.0 9280 5 A7
k141_102511.0 9281 5 A7
k141_102511.0 9282 5 A7


#bed file looks like this
# the bed file shows the start $2 and end $3 position of a gene $4 on the contigs $1
k141_102288 2   301 phnE
k141_102511.0   7890    8807    phnE
k141_102511.0   8814    10400   phnE


#proposed output (note the two different regions of phnE on k141_102511.0)
k141_102288 298 5 A4    phnE_001
k141_102288 298 5 A5    phnE_001
k141_102288 298 5 B8    phnE_001
k141_102288 298 5 B9    phnE_001
k141_102288 299 5 A4    phnE_001
k141_102288 299 5 A5    phnE_001
k141_102288 299 5 B9    phnE_001
k141_102288 300 5 A5    phnE_001
k141_102288 301 5 A5    phnE_001
k141_102511.0 8226 5 A5 phnE_002
k141_102511.0 8227 5 A5 phnE_002
k141_102511.0 8228 5 A5 phnE_002
k141_102511.0 8229 5 A5 phnE_002
k141_102511.0 8230 5 A5 phnE_002
k141_102511.0 8231 5 A5 phnE_002
k141_102511.0 8232 5 A5 phnE_002
k141_102511.0 8233 5 A5 phnE_002
k141_102511.0 8234 5 A5 phnE_002
k141_102511.0 9129 5 A6 phnE_003
k141_102511.0 9207 5 A6 phnE_003
k141_102511.0 9275 5 A7 phnE_003
k141_102511.0 9276 5 A7 phnE_003
k141_102511.0 9277 5 A7 phnE_003
k141_102511.0 9278 5 A7 phnE_003
k141_102511.0 9279 5 A7 phnE_003
k141_102511.0 9280 5 A7 phnE_003
k141_102511.0 9281 5 A7 phnE_003
k141_102511.0 9282 5 A7 phnE_003

我试图利用我以前遇到的类似问题,但仍然不知道如何使它起作用:

I tried to make use of a former similar question I had, but still can't figure out how to make it work: How to use info on substring position from one file to extract substring from another file (loop, bash)

有什么建议吗?我试着拒绝建议.2由@ Nic3500提供,但我无法运行它.我在最后一行中有一个意外的令牌.到目前为止,这是我想出的:

any suggestions? I tried to go with suggestion no. 2 by @Nic3500, but I can't get it to run. I have an unexpected token in the last line. This is what I came up with so far:

#!bin/bash

# We are reading two files: coverage_file.txt and intersect.bed
# NR is equal to FNR as long as we are reading the
# first file.
# Store the positions in an array current_position from the coverage file (indexed by $1)
# go to bed file
# store the start and end positions and the gene names in similar arrays
# if current_position is between start_pos and end_pos, print additionally gene name 

awk 'NR==FNR{current_position[$1]=$2} 
NR==FNR{next}
{start_pos[$1]=$2;end_pos[$1]=$3;gene_name[$1]=$4}
{if(current_position[$1] >= start_pos[$1]) && (current_position[$1] <= `end_pos[$1]){ print $1,$2,$3,$4,gene_name[$1]}}' coverage_file.txt intersect.bed > test.txt`

推荐答案

awk 进行救援!

 $ awk 'NR==FNR{start[NR]=$2; end[NR]=$3; key[$1,$2]=$4 sprintf("_%03d",NR); next}
           {for(i in start)
              {s=start[i];
               if(s<=$2 && $2<=end[i] && ($1,s) in key) print $0,key[$1,s]}}' bed coverage 

说明在读取第一个文件( NR == FNR 部分)时,创建以行号索引的数组,以开始和结束范围.我们需要将范围与键相关联,因此创建一个以键为索引的地图并为每个范围开始.这里也有机会使用行号计数器和最后一个字段作为标签来创建索引标签,将数字格式设置为零填充三位数字.

Explanation While reading the first file (NR==FNR section), create arrays indexed with line number to start and end of the ranges. We need to correlate the ranges with the key, so create a map indexed with key and start for each range; here also the opportunity to create the indexed tag using the line number counter and the last field as the label, format the number to zero padded three digits.

要处理第二个文件(现在为第二条语句),我们将遍历所有开头以找到匹配的结尾并验证键,范围开头是有效的组合,并打印添加了格式后缀的行.

For processing the second file (second statement now), we iterate through all the starts find the matching end and verify the key,range start is a valid combination, print the line with the formatted suffix added.

可以通过使用键索引起始值来提高效率,但是会使代码复杂化.如果您的床"文件不是很大,那应该不是问题.也有意打印所有匹配的条目,而不是第一个,以验证范围不重叠.否则,请脱机进行验证,并通过在第一次匹配/打印后中断来提高速度.另外,如果对起始值进行了排序,那么错过起始范围时也可以提前退出循环.

This can be made more efficient by indexing the start values with the key, but will complicate the code. If your "bed" file is not huge, shouldn't be a problem. Also intentionally prints all the matching entries, not the first one to validate that the ranges do not overlap. Otherwise, do the validation offline and improve speed by breaking after the first match/print. Also if the start values are sorted, loop can be exited early when the start range is missed.

这篇关于如何使用awk根据数字范围将特定值添加到列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆