Fasta文件读取python [英] fasta file reading python

查看:47
本文介绍了Fasta文件读取python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取具有以下格式的FASTA文件:

I am reading a FASTA file that has a format like this:


>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]
MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF 

我必须读取文件,然后计算JC距离(对于一对序列,JC距离为-3/4 * ln(1-4-3 * p),其中p是一对之间有差异

I have to read the file and then calculate the JC distance (For a pair of sequences, the JC distance is -3/4 * ln(1 - 4/3 * p), where p is the proportion of sites that differ between the pair)

我已经设置了它的框架,但是不确定如何做其余的事情.读取并计算JukesCantor距离之后,我必须将其写入新的输出文件中,并且应该在表格中我能得到的任何帮助都非常感激!谢谢,python和fasta文件的新手

I have set up the skeleton of it but am unsure how to do the rest. AFter reading and calculating the JukesCantor distance I have to write it to a new output file and it should be in a table any help i can get is much appreciated! thanks, new to python AND fasta files

def readData():
    filename = input("Enter the name of the FASTA file: ")
    infile = open(filename, "r")

def CalculateJC(x,y):
    if x == y:
        return 0
    else:
        return 1 # temporary*

def calcDists(seqs):
    output = []
    for seq1 in seqs:
        newrow = []
        for seq2 in seqs:
            dist = calculateJS(seq1,seq2)
            newrow.append(dist)
        output.append(newrow)
        list(enumerate(seasons))
    return output


def outputDists(distMat):
    pass

def main():
    seqs = readData()
    distMat = calcDists(seqs)
    outputDists(distMat)



if__name__ == "__main__":
    main()

推荐答案

您一次提出的问题太多!专注于一个.

You are asking too many questions at a time! Focus on one.

读写FASTA文件在 BioPython 中进行了介绍(如注释中所建议).

Reading and writing FASTA files is covered in BioPython (as suggested in comments).

我注意到您尚未计算JC距离,所以也许这是您需要帮助的地方.这是我想出的:

I noticed that you aren't calculating your JC distance yet, so perhaps this is where you need help. Here is what I came up with:

import itertools, math

def computeJC(seq1, seq2):
    equal = 0
    for base1, base2 in itertools.izip(seq1, seq2):
        equal += (base1 == base2)
    p = equal / float(len(seq1))     
    return -3/4 * math.log(1 - 4/3 * p)  

此处解释了itertools.izip技巧:

The itertools.izip trick is explained here: How can I iterate through two lists in parallel This piece of code will work with any kind of string, and the look will stop when either seq1 or seq2 reaches the end.

其他人可能会提出"Pythonic单线",但请先尝试理解我的方法.它避免了代码陷入的陷阱:嵌套循环,不必要的分支,运行时列表增长,意大利面条式代码等等.享受!

Someone else may come up with a "Pythonic one-liner", but try to understand my approach first. It avoids the pitfalls that your code felt into: nested loops, unnecessary branching, runtime list growing, spaghetti code to name a few. Enjoy!

这篇关于Fasta文件读取python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆