如何使用具有特定步长的窗口提取短序列? [英] How to extract short sequence using window with specific step size?

查看:97
本文介绍了如何使用具有特定步长的窗口提取短序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码在窗口大小为4的每个序列中提取短序列.如何将窗口移位2步长并提取4个碱基对?

The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs?

示例代码

from Bio import SeqIO

with open("testA_out.fasta","w") as f:
        for seq_record in SeqIO.parse("testA.fasta", "fasta"):
            i = 0
            while ((i+4) < len(seq_record.seq)) :
              f.write(">" + str(seq_record.id) + "\n")
              f.write(str(seq_record.seq[i:i+4]) + "\n")
              i += 2

testA.fasta的示例输入

Example Input of testA.fasta

>human1
ACCCGATTT

testA_out的示例输出

Example Output of testA_out

>human1
ACCC
>human1
CCGA
>human1
GATT

此输出的问题是遗漏了一个T,因此在这种情况下,我希望也将其包括在内.我怎么能得到这个输出?反向提取也包括从开始到结束提取时可能被忽略的碱基对.谁能帮我吗?

The problem with this output is that there are one T left out so in this case I hope to include it as well. How can I come out with this output? With a reverse extract as well to include base pairs that are probably left out when extract from start to end. Can anyone help me?


预期产量

>human1
ACCC
>human1
CCGA
>human1
GATT
>human1
ATTT
>human1
CGAT    
>human1
CCCG

推荐答案

您可以对range使用for循环,并为range使用第三个step参数.这样,它比使用while循环要干净一些.如果无法将数据除以块大小,则最后一个块会更小.

You can use a for loop with range, using the third step parameter for range. This way, it's a bit cleaner than using a while loop. If the data can not be divided by the chunk size, then the last chunk will be smaller.

data = "ACCCGATTT"
step = 2
chunk = 4
for i in range(0, len(data) - step, step):
    print(data[i:i+chunk])

输出为

ACCC
CCGA
GATT
TTT

这篇关于如何使用具有特定步长的窗口提取短序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆