提取每个测序数据作为单独的文件 [英] extract each sequencing data as individual file
问题描述
有与指示基因测序的名字行的 ecoli.ffn
文件:
There is a ecoli.ffn
file with rows indicating the name of sequencing genes:
$head ecoli.ffn
>ecoli16:g027092:GCF_000460315:gi|545267691|ref|NZ_KE701669.1|:551259-572036
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC
如上图所示,该基因命名为第一和第二之间的冒号:
As shown above, the gene name is between the 1st and 2nd colon:
g027092
g000011
g000012
我想用 ecoli.ffn
来生成三个文件: g027092.txt
, g000011.txt
, g000012.txt
,其中包含每个测序数据。
I would like to use ecoli.ffn
to generate three files: g027092.txt
, g000011.txt
,g000012.txt
, containing each sequencing data.
例如, g027092.txt
将包含原始数据,但无标题
For example, g027092.txt
will contains the raw data but without the header:
$cat g027092.txt
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT
如何制作的?
推荐答案
AWK
来救援!
$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
for(i=1;i<n;i++) a[t[i]];
next}
$2 in a{file=$2".txt";
sub(/[^\n]+\n/,"");
print > file}' index file
$ head g*.txt
==> g000011.txt <==
GTGTACGCTATGGCGGGTAATTTTGCCGAT
==> g000012.txt <==
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC
==> g027092.txt <==
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT
说明
NR == FNR {N = SP ...
块分析的第一个文件,并创建一个查询
表
NR==FNR{n=sp...
block parses the first file and creates a lookup table
$ 2一个{文件= $ 2.TXT;
如果当前记录是在查找表,
使用密钥和txt扩展设置文件名
$2 in a{file=$2".txt";
if the current record is in the lookup table,
set a file name using the key and txt extension
子(/ [^ \\ n] + \\ n /,)
删除标题行
sub(/[^\n]+\n/,"")
delete the header line
打印&GT;文件
并打印到指定的
文件名。
print > file
and print to the specified
filename.
这篇关于提取每个测序数据作为单独的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!