使用biopython将dna对齐转换为numpy数组 [英] Transform dna alignment into numpy array using biopython

查看:92
本文介绍了使用biopython将dna对齐转换为numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个已比对的DNA序列,我只想在特定位置保留可变碱基.

I have several DNA sequences that have been aligned and I would like to keep only the bases that are variable at a specific position.

如果我们首先将对齐方式转换为数组,则可以完成此操作.我尝试使用Biopython教程中的代码,但出现错误.

This maybe could be done if we first transform the alignment into an array. I tried using the code in the Biopython tutorial but it gives an error.

import numpy as np
from Bio import AlignIO
alignment = AlignIO.parse("ma-all-mito.fa", "fasta")
align_array = np.array([list(rec) for rec in alignment], np.character)
print("Array shape %i by %i" % align_array.shape)

我得到的错误:

Traceback (most recent call last):

File "C:/select-snps.py", line 8, in <module>
    print("Array shape %i by %i" % align_array.shape)
TypeError: not all arguments converted during string formatting

推荐答案

我是在回答您的问题,而不是修复代码.如果只想保留某些职位,请使用AlignIO:

I'm answering to your problem instead of fixing your code. If you want to keep only certain positions, you want to use AlignIO:

FASTA示例al.fas:

>seq1
CATCGATCAGCATCGACATGCGGCA-ACG
>seq2
CATCGATCAG---CGACATGCGGCATACG
>seq3
CATC-ATCAGCATCGACATGCGGCATACG
>seq4
CATCGATCAGCATCGACAAACGGCATACG

现在假设您只想保留某些职位. MultipleSeqAlignment 允许您查询对齐方式像一个numpy数组:

Now suppose you want to keep only certain positions. MultipleSeqAlignment allows you to query the alignment like a numpy array:

from Bio import AlignIO


al = AlignIO.read("al.fas", "fasta")

# Print the 11th column
print(al[:, 10])

# Print the 12-15 columns
print(al[:, 11:14])

如果您想知道路线的形状,请使用lenget_alignment_length:

If you want to know the shape of the alignment, use len and get_alignment_length:

>>> print(len(al), al.get_alignment_length())
4 29


当您使用AlignIO.parse()加载对齐方式时,它假定要解析的文件可能包含多个对齐方式(PHYLIP会这样做).因此,解析器在每次对齐时都返回一个迭代器,而不是代码所暗示的那样在记录上返回.但是您的FASTA文件每个文件仅包含一个对齐方式,并且parse()仅产生一个MultipleSeqAlignment.因此,对您的代码的修复是:


When you use AlignIO.parse() to load an alignment, it assumes the file to be parsed could contain more than one alignment (PHYLIP does this). Thus the parser returns an iterator over each alignment and not over records as your code implies. But your FASTA file only contain one alignment per file and parse() yields only one MultipleSeqAlignment. So the fix to your code is:

alignment = AlignIO.read("ma-all-mito.fa", "fasta")
align_array = np.array(alignment, np.character)
print("Array shape %i by %i" % align_array.shape)

这篇关于使用biopython将dna对齐转换为numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆