如何使用Python随机提取FASTA序列? [英] How to randomly extract FASTA sequences using Python?
问题描述
我有以下序列,这些序列是带有序列标头及其核苷酸的fasta格式.如何随机提取序列.例如,我想从总序列中随机选择2个序列.提供的工具可以根据百分比而不是序列数进行提取.谁能帮我吗?
I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me?
A.fasta
>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
预期产量
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
推荐答案
如果您正在使用fasta文件,请使用 BioPython ,要获取n
序列,请使用 random.sample :
If you are working with fasta files use BioPython, to get n
sequences use random.sample:
from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
seqs = SeqIO.parse(f,"fasta")
print(sample(list(seqs), 2))
输出:
[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]
如果需要,您可以提取字符串:
You can extract the strings if necessary:
print([(seq.name,str(seq.seq)) for seq in sample(list(seqs),2)])
[('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]
如果行始终成对出现,并且您跳过了顶部的元数据,则可以压缩:
If the lines were always in pairs and you skipped the metadata at the top you could zip:
from random import sample
with open("foo.fasta") as f:
print(sample(list(zip(f, f)), 2))
哪个会给您成对的元组行:
Which will give you pairs of lines in tuples:
[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]
要准备好编写以下行:
from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
seqs = SeqIO.parse(f, "fasta")
samps = ((seq.name, seq.seq) for seq in sample(list(seqs),2))
for samp in samps:
print(">{}\n{}".format(*samp))
输出:
>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
这篇关于如何使用Python随机提取FASTA序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!