无法仅解析FASTA文件中的序列 [英] Unable to parse just sequences from FASTA file

查看：310 发布时间：2020/9/21 3:16:34 python bioinformatics fasta

本文介绍了无法仅解析FASTA文件中的序列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何从序列中删除像'>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'这样的ID?

How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences?

我有此代码:

with open('sequence.fasta', 'r') as f :
    while True:
        line1=f.readline()
        line2=f.readline()
        line3=f.readline()
        if not line3:
            break
        fct([line1[i:i+100] for i in range(0, len(line1), 100)])
        fct([line2[i:i+100] for i in range(0, len(line2), 100)])
        fct([line3[i:i+100] for i in range(0, len(line3), 100)])

输出:

['>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n']
['CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n']
['AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n']
['CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA\n']
['AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA\n']
['ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT\n']
['AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA\n']
['GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC\n']
['AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT\n']
['TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT\n']
['GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT\n']
['GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC\n']
['\n']
...

我的功能是:

def fct(input_string):
    code={"a":0,"c":1,"g":2,"t":3}
    p=[code[i] for i in input_string]
    n=len(input_string)
    c=0

    for i, n in enumerate(range(n, 0, -1)):
        c +=p[i]*(4**(n-1))
        return c+1

fct()从字符串返回整数.例如，ACT给出8 即:我的函数必须将输入字符串序列仅包含以下基数A，C，G，T

fct() returns an integer from a string. For example, ACT gives 8 i.e.: my function must take as input string sequences contain just the following bases A,C,G,T

但是当我使用我的函数时，它会给出:

But when I use my function it gives:

KeyError: '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'

我尝试通过删除以>开头的行并将其余部分写入文本文件中来删除ID，因此，我的文本文件output.txt仅包含不带ID的序列，但是当我使用函数 fct >我发现了相同的错误:

I try to remove ids by stripping lines start with > and writing the rest in text file so, my text file output.txt contains just sequences without ids, but when I use my function fct I found the same error:

KeyError: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n'

我该怎么办?

遍历序列数据

在您的代码中，您有一个传递给fct()的字符串列表(input_string实际上不是字符串，而是字符串列表).解决方案只是构建一个输入字符串，然后对其进行迭代.

Iterating over sequence data

In your code, you have a list of strings being passed to fct() (input_string is not actually a string, but a list of strings). The solution is just to build one input string, and iterate over that.

您需要将字典的键大写:大小写很重要
您应该在for循环后的之后有return语句.保持嵌套意味着c立即返回.
为什么在遍历序列时只能索引到code时麻烦构造p?
通过在for循环中使用序列的长度(n)作为变量名来写出

You need to capitalize the keys to your dictionary: case matters
You should have the return statement after the for loop. Keeping it nested means c is returned immediately.
Why bother constructing p when you can just index into code when iterating over the sequence?
You write over the sequence's length (n) by using it in your for loop as a variable name

修改后的代码(使用正确的 PEP 8 格式)，并重命名变量弄清楚它们的含义(仍然不知道c应该是什么):

Modified code (with proper PEP 8 formatting), and variables renamed to be clearer what they mean (still have no idea what c is supposed to be):

from Bio import SeqIO


def dna_seq_score(dna_seq):
    nucleotide_code = {"A": 0, "C": 1, "G": 2, "T": 3}

    c = 0 
    for i, k in enumerate(range(len(dna_seq), 0, -1)):
        nucleotide = dna_seq[i]
        code_num = nucleotide_code[nucleotide]
        c += code_num * (4 ** (k - 1)) 
    return c + 1 


for record in SeqIO.parse("test.fasta", "fasta"):
    dna_seq_score(record.seq)

这篇关于无法仅解析FASTA文件中的序列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法仅解析FASTA文件中的序列 [英] Unable to parse just sequences from FASTA file

问题描述

推荐答案

遍历序列数据

Iterating over sequence data

相关文章

Python最新文章

热门教程

热门工具

登录关闭

无法仅解析FASTA文件中的序列 [英] Unable to parse just sequences from FASTA file

问题描述

推荐答案

遍历序列数据

Iterating over sequence data

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭