如何使用具有特定步长的窗口提取短序列? [英] How to extract short sequence using window with specific step size?
问题描述
下面的代码在窗口大小为4的每个序列中提取短序列.如何将窗口移位2步长并提取4个碱基对?
The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs?
示例代码
from Bio import SeqIO
with open("testA_out.fasta","w") as f:
for seq_record in SeqIO.parse("testA.fasta", "fasta"):
i = 0
while ((i+4) < len(seq_record.seq)) :
f.write(">" + str(seq_record.id) + "\n")
f.write(str(seq_record.seq[i:i+4]) + "\n")
i += 2
testA.fasta的示例输入
Example Input of testA.fasta
>human1
ACCCGATTT
testA_out的示例输出
Example Output of testA_out
>human1
ACCC
>human1
CCGA
>human1
GATT
此输出的问题是遗漏了一个T,因此在这种情况下,我希望也将其包括在内.我怎么能得到这个输出?反向提取也包括从开始到结束提取时可能被忽略的碱基对.谁能帮我吗?
The problem with this output is that there are one T left out so in this case I hope to include it as well. How can I come out with this output? With a reverse extract as well to include base pairs that are probably left out when extract from start to end. Can anyone help me?
预期产量
>human1
ACCC
>human1
CCGA
>human1
GATT
>human1
ATTT
>human1
CGAT
>human1
CCCG
推荐答案
您可以对range
使用for
循环,并为range
使用第三个step
参数.这样,它比使用while
循环要干净一些.如果无法将数据除以块大小,则最后一个块会更小.
You can use a for
loop with range
, using the third step
parameter for range
. This way, it's a bit cleaner than using a while
loop. If the data can not be divided by the chunk size, then the last chunk will be smaller.
data = "ACCCGATTT"
step = 2
chunk = 4
for i in range(0, len(data) - step, step):
print(data[i:i+chunk])
输出为
ACCC
CCGA
GATT
TTT
这篇关于如何使用具有特定步长的窗口提取短序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!