计算DNA PSET6 CS50中的重复STR [英] Counting repeated STR in DNA PSET6 CS50

查看:59
本文介绍了计算DNA PSET6 CS50中的重复STR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前正在使用CS50.我试图在文件DNA序列中计算STR,但是它总是过高.

Currently working on CS50. I tried to count STR in file DNA Sequences but it always overcount.

例如,我的意思是:文件DNA中的"AGATC"连续重复多少次.

I mean, for example: how much 'AGATC' in file DNA repeat consecutively.

此代码仅试图找出如何准确计算那些重复的DNA.

This code is only try to find out how to count those repeated DNA accurately.

import csv
import re
from sys import argv, exit

def main():
    if len(argv) != 3:
        print("Usage: python dna.py data.csv sequence.txt")
        exit(1)

    with open(argv[1]) as csv_file, open(argv[2]) as dna_file:
        reader = csv.reader(csv_file)
        #for row in reader:
        #    print(row)

        str_sequences = next(reader)[1:]

        dna = dna_file.read()
        for i in range(len(dna)):
            count = len(re.findall(str_sequences[0], dna))   # str_sequences[0] is 'AGATC'
        print(count)

main()

DNA文件11(AGATC)的结果:

result for DNA file 11 (AGATC):

$ python dna.py databases/large.csv sequences/11.txt
52

结果应该是43.但是,对于small.csv,其计数准确.但总的来说,它总是超支的.后来我知道我的代码对DNA文件(AGATC)中的每个匹配词进行计数.但是任务是,您必须获取仅连续重复的DNA,而忽略是否再次出现相同的DNA.

The result supposed to be 43. But, for small.csv, its count accurately. But for large it always over count. Later i know that my code its counting all every match word in DNA file (AGATC). But the task is, you have to take the DNA that only repeat consecutively and ignoring if another same DNA showup again.

{AGATCAGATCAGATCAGATC(T)TTTTAGATC}

那么,如何停止计数是否DNA击中了(T),而又不需要计算随后的AGATC?我应该更改我的代码吗?特别是在我使用的re.findall()中.有人说用子串,怎么用子串?还是我可以像我一样使用regEx?

So, how to stop counting if the DNA hit the (T), and it doesn't need to count AGATC that comes after? What should i change in my code? especially in re.findall() that i use. Some people said use substring, how to use substring? or maybe can i just use regEx like i did?

如果可以,请编写您的代码.对不起,我的英语不好.

Please write your code if you can. sorry for my bad english.

推荐答案

for循环是错误的,即使在循环的前面已经找到了序列,它也会继续对序列进行计数.我认为您想改为遍历 str_sequences .

The for loop is wrong, it will keep counting the sequences even if they are already found earlier in the loop. I think you want to instead loop over the str_sequences.

类似的东西:

seq_list = []

for STR in str_sequences:
    groups = re.findall(rf'(?:{STR})+', dna)
    if len(groups) == 0:
        seq_list.append('0')
    else:
        seq_list.append(str(max(map(lambda x: len(x)//len(STR), groups))))

print(seq_list)

关于此问题,还有很多帖子.也许,您可以检查其中一些以完成您的程序.

Also, there are many posts on this problem. Maybe, you can examine some of them to finish your program.

这篇关于计算DNA PSET6 CS50中的重复STR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆