给定第二个包含序列名称的文件,使用AWK搜索fasta文件 [英] Use AWK to search through fasta file, given a second file containing sequence names

查看:112
本文介绍了给定第二个包含序列名称的文件,使用AWK搜索fasta文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个文件.一个是fasta文件,其中包含多个fasta序列,而另一个文件中包含我要搜索的候选序列的名称(下面的文件示例).

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).

seq.fasta

seq.fasta

>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC

name.txt

Clone_23
Clone_27-1

我想使用AWK搜索fasta文件,并获取其名称保存在另一个文件中的给定候选者的所有fasta序列.

I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.

awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta

问题是我只能这样提取name.txt中第一个候选者的序列

The problem is that I can only extract the sequence of first candidate in name.txt, like this

>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA

任何人都可以帮助修复上面的单行awk命令吗?

Can anyone help to fix one-line awk command above?

推荐答案

如果可以,甚至还可以打印名称,则可以简单地使用grep:

If it is ok or even desired to print the name as well, you can simply use grep:

grep -Ff name.txt -A1 a.fasta

  • -f name.txtname.txt
  • 中选择模式
  • -F将它们视为文字字符串而不是正则表达式
  • A1打印匹配的行以及下一行
    • -f name.txt picks patterns from name.txt
    • -F treats them as literal strings rather than regular expressions
    • A1 prints the matching line plus the subsequent line
    • 如果在输出中不需要名称,我将简单地通过管道传送到另一个grep:

      If the names are not desired in output I would simply pipe to another grep:

      above_command | grep -v '>'
      


      一个awk解决方案可以如下所示:


      An awk solution can look like this:

      awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
      

      在多行版本中进行更好的解释:

      Better explained in a multiline version:

      # True as long as we are reading the first file, name.txt
      NR==FNR {
          # Store the names in the array 'n'
          n[$0]
          next
      }
      
      # I use substr() to remove the leading `>` and check if the remaining
      # string which is the name is a key of `n`. getline retrieves the next line
      # If it succeeds the condition becomes true and awk will print that line
      substr($0,2) in n && getline
      

      这篇关于给定第二个包含序列名称的文件,使用AWK搜索fasta文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆