使用awk通过文件中的ID从multifasta文件中提取序列 [英] extract sequences from multifasta file by ID in file using awk

查看:537
本文介绍了使用awk通过文件中的ID从multifasta文件中提取序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从multifasta文件中提取与单独ID列表给出的ID匹配的序列.

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.

FASTA文件seq.fasta:

FASTA file seq.fasta:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT

ID文件id.txt:

IDs file id.txt:

7P58X:01332:11636
7P58X:01334:11613

我想获取只包含与id.txt文件中的ID匹配的序列的fasta文件:

I want to get the fasta file with only those sequences matching the IDs in the id.txt file:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

我真的很喜欢在答案此处中找到的awk方法和此处,但是对于我给出的示例,此处给出的代码仍然无法完美运行.这是原因:

I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:

(1)

awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta

此代码对于多行序列效果很好,但是ID必须单独插入到代码中.

this code works well for the multiline sequences but IDs have to be inserted separately to the code.

(2)

awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta

此代码可以从id.txt文件中获取ID,但仅返回多行序列的第一行.

this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.

我想这很不错,那就是修改代码(2)中的RS变量,但是到目前为止,我的所有尝试都失败了.有人可以帮我吗?

I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?

推荐答案

$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

这篇关于使用awk通过文件中的ID从multifasta文件中提取序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆