我将如何解析包含数千个DNA碱基的文本文件? [英] How would I go about parsing a text file of thousands of DNA bases?

查看:93
本文介绍了我将如何解析包含数千个DNA碱基的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这就是我想要的,我将拥有一个庞大的文本文件,其中包含一堆dna基(A,T,C,G),我想做的是每60个字符(任意)输入一个换一行,这样就可以将碱基分成几大块.但是,我也希望每个块有一定数量的碱基重叠.例如,如果给出了这10个字母的块ATGGCTGCTA,并且最初的4个块的块是ATGG,如果将重叠参数指定为2,则接下来的4个块的块将是GGCT,然后是CTGC,依此类推.我知道我可能必须考虑使用python阅读,打开和编写文本文件.如果有资源的话,他们可以指出实现此目标以及任何有用的提示和指示的努力.

Here's what I would have, I would have a massive text file of a bunch of dna bases (A, T, C, G) and what I would like to do is take every 60 characters (arbitrary) and put it on a new line so that way the bases get separated out in chunks. But, I would also like for there to be overlap of each chunk by a certain number of bases. For example, if this 10 letter chunk ATGGCTGCTA was given, and the initial 4 block chunk was ATGG, if there overlap parameter was specified to be 2, then the next 4 block chunk would be GGCT, then CTGC and so on. I know I'll probably have to look into reading, opening, and writing text files with python. If any has resources they could point me torwards on achieving this and any tips and instructions that would be great.

我将使用的文本格式示例:

Example of the formatting of the text I would be working with:

https://www.ncbi.nlm.nih.gov/nuccore/NC_000017.11?report=fasta&from=7661779&to=7687550

推荐答案

data = 'GAGACAGAGTCTCACTCTGTTGCACAGGCTGGAGTGCAGTGGCACAATCTCTGCTCACTGCAACCTCCTC'
chunk_size = 5
overlap = 2

for pos in range(0, len(data), chunk_size - overlap):
    print(data[pos:pos+chunk_size])

结果:

GAGAC
ACAGA
GAGTC
TCTCA
CACTC
TCTGT
...

这篇关于我将如何解析包含数千个DNA碱基的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆