按序列大小对Fasta进行排序 [英] sort fasta by sequence size
问题描述
我目前想按序列大小对轻快的fasta文件(+ 10 ** 8行和序列)进行排序. fasta是生物学中用于存储序列(遗传或蛋白质)的明确定义格式:
I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
> id1
序列1#可能在多行上
> id2
序列2
...
我运行了一个以tsv格式显示的工具:
I have run a tools that give me in tsv format:
标识符的长度,长度和位置(以字节为单位).
the Identifiant, the length, and the position in bytes of the identifiant.
现在我要做的是按长度列对文件进行排序,然后解析该文件,并使用搜索"来检索相应的序列,然后将其附加到新文件中.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
我的问题是以下情况确实很慢,这是正常现象吗(需要几天时间)?我还有另一种方法吗?我不是一个纯粹的信息专家,所以我可能会遗漏一点,但是我认为索引文件和使用查找是实现这一目标的最差方法,我错了吗?
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?
推荐答案
似乎为每个序列打开两个文件可能对运行时间有很大影响.您可以将文件句柄传递给get/write函数,而不是文件名,但是我建议使用已建立的fasta解析器/索引器,例如biopython或samtools.这是一个使用samtools的(未经测试的)解决方案:
Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
这篇关于按序列大小对Fasta进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!