如何从基因列表中选择多个 fasta 序列 [英] How to pick multiple fasta sequences from a genes list
问题描述
我有两个文件
基因列表文件是这样的
LOC_Os06g12230.1
Pavir.Ab03005
Pavir.J14065
ChrUn.fgenesh
Sevir.1G325700
LOC_Os02g51280.1
Bradi3g59320
Brast04G017400
fasta 序列文件看起来像这样
Fasta sequence file looks like this
>LOC_Os03g57190.1 pacid=33130570 polypeptide=LOC_Os03g57190.1 locus=LOC_Os03g57190 ID=LOC_Os03g57190.1.MSUv7.0 annot-version=v7.0
ATGGAGGCGGCGGTGGGGGACGGGGAAGGCGGTGGCGGCGGCGGCGGGCGGGGGAAGCGTGGGCGGGGAGGAGGAGGAGG
GGAGATGGTGGAGGCGGTGTGGGGGCAGACGGGGAGTACGGCGTCGCGGATCTACAGGGTGAGGGCGACGGGGGGGAAGG
ACAGGCACAGCAAGGTGTACACGGCGAAGGGAATCCGCGACCGCCGCGTCCGCCTCTCCGTCGCCACCGCCATCCAGTTC
TACGACCTCCAGGACCGCCTCGGCTTCGACCAGCCGAGCAAGGCCATCGAGTGG
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
>LOC_Os05g43760.1 pacid=33158388 polypeptide=LOC_Os05g43760.1 locus=LOC_Os05g43760 ID=LOC_Os05g43760.1.MSUv7.0 annot-version=v7.0
ATGACAAGCAATAACAGCACGAATGAGGAGCTCGGCGGCGGCGGCAGGAAGGCGGCCGACAAGCCGAGCGGCGGCGGCGG
CGCCGCCGCCGCCGTGGCGAGCTCGCGGCACTGGTCGGCGTCGACGGAGTCGCGGATCGTGCGCGTGTCGAGGGTGTTCG
GCGGCAAGGACCGTCACAGCAAGGTGAGGACGGTGAAGGGGCTCCGCGACCGGCGGGTGCGGCTGTCGGTGCCGACGGCG
ATCCAGCTCTACGACCTGCAGGACCGGCTGGGGCTCAGCCAGCCGAGCAAGGTGGTCGACT
如果基因名称和标题行匹配,则必须将序列拉出到新文件中
if the gene name and header line matches then, sequence has to be pulled out into new file
新文件应该包含
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
我试过这样
grep -f genelist.txt -A3 fastafile.txt >> newfasta.txt
但是不同的fasta序列有不同的长度,
but different fasta sequences have different lengths,
模式匹配后,我想选择直到下一个>"符号出现
After pattern match, i want to pick till next '>' symbol appears
推荐答案
使用 awk 处理 FASTA 文件的最简单方法是建立一个名为 name
的变量和一个名为 seq 的变量代码>.每次阅读完整序列时,都可以对其进行处理.请注意,为了获得最佳处理方式,序列应存储为连续字符串,并且不包含任何换行符或空格.用于处理 fasta 的通用 awk,如下所示:
The easiest way to process FASTA files with awk, is to build up a variable called name
and a variable called seq
. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or whitespaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { process_sequence_here }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { process_sequence_here }' file.fasta
您可以通过引入几个函数来简化此操作:
You can make this a bit easier by introducing a couple of functions:
awk '/^>/ && seq { process_sequence(name_seq) }
/^>/{name=substr($0,2); seq=""; next}
{seq = seq $0 }
END { process_sequence(name,seq) }
BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
function print_sequence(name,seq) {
gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
}
function process_sequence(name,seq) { ... }
' file.fasta
在 OP 的情况下,以上内容为:
In case of the OP, the above would read:
awk '(NR==FNR) { a[$0]; next }
/^>/ && seq { process_sequence(name_seq) }
/^>/{name=substr($0,2); seq=""; next}
{seq = seq $0 }
END { process_sequence(name,seq) }
BEGIN{seq_ere=sprintf("%80s","");gsub(" ",".",seq_ere) }
function print_sequence(name,seq) {
gsub(seq_ere,"&" ORS, seq); print ">" name ORS seq
}
function process_sequence(name,seq) {
$0=name; if ($1 in a) print_sequence (name,seq)
}
' list.txt file.fasta
当您使用 awk 处理 fasta 文件时,您始终可以考虑使用 bioawk.它具有来自 POSIX awk 的所有花里胡哨的功能,但经过增强以轻松处理 FASTA 文件:
When you process fasta-files with awk, you can always concider to use bioawk. It has all the bells-and-whistles from POSIX awk, but is augmented to easily process FASTA files:
注意: BioAwk 基于 Brian Kernighan 的 awk,记录在 "AWK 编程语言",作者:Al Aho、Brian Kernighan 和 Peter Weinberger(Addison-Wesley, 1988, ISBN 0-201-07981-X).我不确定这个版本是否与 POSIX 兼容.
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
这篇关于如何从基因列表中选择多个 fasta 序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!