如何从一个文件中使用子字符串位置信息从另一个文件中提取子字符串(循环,bash) [英] How to use info on substring position from one file to extract substring from another file (loop, bash)

查看:166
本文介绍了如何从一个文件中使用子字符串位置信息从另一个文件中提取子字符串(循环,bash)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我努力编写一个脚本来循环从一个文件中提取子字符串,同时获取有关从另一个文件中删除的信息。我在MobaXterm工作。我有文件cut_positions.txt,它是制表符分隔的,并显示名称,起点,终点,长度和注释:

  k141_20066 103484 104617 1133 phnW 
k141_20841 13200 14324 1124 phnW
k141_23852 69 452 383 phnW
k141_32328 1 180 179 phnW

和带名称的string_file.txt(在其中一个文件中删除/添加>是没有问题的)和字符串(原始字符串更长, :

 > k141_10671 CCTTCCCCCACACGCCGCTCTTCCGCTCTTGCTGGCC 
> k141_10707 AGGCGGTATCAGACCTTGCCGCAACACTAAGCCCAGTAACGCTGTCGCCCTTATATCTGA
> k141_11190 CTTTTGTGACAGTGCAGGGCAATGGTGGATTTATCAGTATCGGGCAGAA
> k141_1479 AGCCGACAGCAGCGCCGAGGGCACATAATCCGATGACACGATGTCCAAAAGATCCGCCTCGGC

现在我想使用cut_positions的输入。文本。我想使用第一列匹配右边的行,然后第二列作为子串的起始点,第四列作为子串的长度。这应该使用cut_positions.txt中的所有行并写入新的out.txt。为了更接近我尝试过(使用我的原始数据):

 ➤grep> k141_28027\btest_out_one_line.txt | awk'{print substr($ 2,57251,69)}'
TCACTTGAGCGCAATTATTCGCTCTCCGGCGGCGTCAGCATCAGCCTGATCATGCGTCACCAAAAGTGT

以及手工制作的方式。我想出了如何访问cut_positions.txt中的不同元素(这里是第二列的第一行):

  awk -F'\t''NR == 1 {print $ 2}'cut_positions.txt 

但我不知道如何将它变成一个循环,因为我不知道如何连接我用于小步骤的不同重定向,管道步骤等。任何帮助非常感谢(并告诉我,如果您需要更多样本数据)

谢谢
crazysantaclaus

解决方案

以下脚本应该适用于您:

cut.awk

 #我们正在阅读两个文件:pos.txt和strings.txt 
#NR等于FNR,只要我们读取
#第一个文件。
NR == FNR {
pos [>$ 1] = $ 2#将起始点存储在数组pos(索引为$ 1)
len [>$ 1] = $ 4 #将长度存储在数组len中(索引为$ 1)
next#跳过pos.txt下面的块
}

#这将在strings.txt的每一行上运行
$ 1 in pos {
#根据位置和长度提取$ 2的子字符串
#存储在
key = $ 1
mod = substr($ 2,pos [ key],len [key])
$ 2 = mod
print#打印修改后的行
}

像这样调用它:

  awk -f cut.awk pos.txt strings.txt 






一个重要的提及。 substr()假定字符串从索引 1 开始 - 与大多数编程语言相反,其中字符串以索引 0 。如果 pos.txt 中的头寸基于 0 substr()必须变成:
$ b $ pre $ mod $ substr($ 2,pos [key] + 1,len [key])






我推荐使用简化的有意义的版本来测试它:

pos.txt

  foo 2 5 3 phnW 
bar 4 5 1 phnW
test 1 5 4 phnW

strings.txt

 > foo 123456 
>栏123456
>非123456

输出:

 > foo 234 
> bar 4


I'm trying quite hard to write a script that "loopingly" extracts substrings from one file, while getting the information on where to cut from another file. I'm working in bash on MobaXterm. I have the file cut_positions.txt, which is tab delimited and shows name, start point, end point, length, comment:

k141_20066  103484  104617  1133    phnW  
k141_20841  13200   14324   1124    phnW  
k141_23852  69  452 383 phnW  
k141_32328  1   180 179 phnW 

and the string_file.txt with the name (it would be no problem to remove/add the ">" in one of the files) and the string (the original strings are way longer, up to 1.000.000 characters):

>k141_10671 CCTTCCCCCACACGCCGCTCTTCCGCTCTTGCTGGCC  
>k141_10707 AGGCGGTATCAGACCTTGCCGCAACACTAAGCCCAGTAACGCTGTCGCCCTTATATCTGA  
>k141_11190 CTTTTGTGACAGTGCAGGGCAATGGTGGATTTATCAGTATCGGGCAGAA  
>k141_1479  AGCCGACAGCAGCGCCGAGGGCACATAATCCGATGACACGATGTCCAAAAGATCCGCCTCGGC

Now I want to use the input from the cut_positions.txt. I want to use the first column to match the right line, then the second column as start point of the substring and the fourth column as length of the substring. This should be done with all lines in cut_positions.txt and written to a new out.txt. To get closer I tried (with my original data):

➤ grep ">k141_28027\b" test_out_one_line.txt | awk '{print substr($2,57251,69)}'
TCACTTGAGCGCAATTATTCGCTCTCCGGCGGCGTCAGCATCAGCCTGATCATGCGTCACCAAAAGTGT

which worked well as handmade way. I figured out as well how to access the different elements in cut_positions.txt (here the first row in the second column):

awk -F '\t' 'NR==1{print $2}' cut_positions.txt

but I can't figure out how to turn this into a loop, as I don't know how to connect the different redirections, piping steps and so on that I used for the small steps. Any help is very much appreciated (and tell me, if you need more sample data)

thanks crazysantaclaus

解决方案

The following script should work for you:

cut.awk

# We are reading two files: pos.txt and strings.txt
# NR is equal to FNR as long as we are reading the
# first file.
NR==FNR{
    pos[">"$1]=$2 # Store the startpoint in an array pos (indexed by $1)
    len[">"$1]=$4 # Store the length in an array len (indexed by $1)
    next # skip the block below for pos.txt
}

# This runs on every line of strings.txt
$1 in pos {
    # Extract a substring of $2 based on the position and length
    # stored above
    key=$1
    mod=substr($2,pos[key],len[key])
    $2=mod
    print # Print the modified line
}

Call it like this:

awk -f cut.awk pos.txt strings.txt


One important thing to mention. substr() assumes strings to start at index 1 - in opposite to most programming languages where strings start at index 0. If the positions in pos.txt are 0 based, the substr() must become:

mod=substr($2,pos[key]+1,len[key])


I recommend to test it with simplified, meaningful versions of:

pos.txt

foo  2  5  3    phnW  
bar  4  5  1    phnW
test 1  5  4    phnW

and strings.txt

>foo 123456  
>bar 123456
>non 123456

Output:

>foo 234
>bar 4

这篇关于如何从一个文件中使用子字符串位置信息从另一个文件中提取子字符串(循环,bash)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆