使用Python从Newick格式提取分支长度 [英] Use Python to extract Branch Lengths from Newick Format

查看：274 发布时间：2020/7/3 18:52:08 python regex dna-sequence phylogeny

本文介绍了使用Python从Newick格式提取分支长度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在python中有一个列表，其中包含一项，这是一种以Newick格式编写的树，如下所示:

I have a list in python consisting of one item which is a tree written in Newick Format, as below:

['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']

以树格式显示如下:

我正在尝试编写一些代码，以查看列表项并返回ID(BMNHxxxxxx)，这些ID由分支长度0(例如，小于0.001)(以红色突出显示)连接.我考虑过使用正则表达式，例如:

I am trying to write some code that will look through the list item and return the IDs (BMNHxxxxxx) which are joined by branch length of 0 (or <0.001 for example) (highlighted in red). I thought about using regex such as:

JustTree = []
with JustTree as f:
    for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}", subject, re.I):
        f.extend(match.group()+"\n")

从另一个StackOverflow答案中获取，其中项A将为':'，因为分支长度始终出现在:之后，而项B将为'，'或')'或';'因为这些字符限制了它的三个字符，但是我在正则表达式中经验不足，无法做到这一点.

As taken from another StackOverflow answer where item A would be a ':' as the branch lengths always appear after a : and item B would be either a ',' or ')'or a ';' as these a there three characters that delimit it, but Im not experienced enough in regex to do this.

在这种情况下，通过使用0的分支长度，我希望代码输出['BMNH703458a'，'BMNH703458b'].如果我可以将其更改为也包含ID(例如用户定义值的分支长度为0.01)加入的ID，则将非常有用.

By using a branch length of 0 in this case I want the code to output ['BMNH703458a', 'BMNH703458b']. If I could alter this to also include ID's joined by a branch length of user defined value of say 0.01 this would be highly useful.

如果任何人有任何意见，或者可以给我指出一个有用的答案，我将不胜感激.

If anyone has any input, or can point me to a useful answer I would highly appreciate it.

推荐答案

好的，这是一个仅提取数字(带有小数点)的正则表达式:

Okay, here's a regex to extract only numbers (with potential decimals):

\b[0-9]+(?:\.[0-9]+)?\b

\b确保旁边没有其他数字，字母或下划线.这就是单词边界.

The \bs make sure that there is no other number, letter or underscore around the number right next to it. It's called a word boundary.

[0-9]+匹配多个数字.

(?:\.[0-9]+)?是一个可选组，表示它可能匹配，也可能不匹配.如果第一个[0-9]+后面有一个点和数字，则它将匹配它们.否则，不会.组本身匹配一个点，并且至少一个数字.

(?:\.[0-9]+)? is an optional group, meaning that it may or may not match. If there is a dot and digits after the first [0-9]+, then it will match those. Otherwise, it won't. The group itself matches a dot, and at least 1 digit.

您可以将其与re.findall一起使用，以将所有匹配项放在列表中:

You can use it with re.findall to put all the matches in a list:

import re
NewickTree = ['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']

pattern = re.compile(r"\b[0-9]+(?:\.[0-9]+)?\b")

for tree in NewickTree:
    branch_lengths = pattern.findall(tree)
    # Do stuff to the list branch_lengths
    print(branch_lengths)

对于此列表，您可以打印以下内容:

For this list, you get this printed:

['0.16529463651919140688', '0.22945757727367316336', '0.18028180766761139897',
 '0.21469677818346077913', '0.54350916483644962085', '0.00654573856803835914', 
 '0.04530853441176059537', '0.02416511342888815264', '0.21236619242575086042',
 '0.13421900772403019819', '0.14957653992840658219', '0.02592135486124686958', 
 '0.02477670174791116522', '0.22983459269245612444', '0.00000328449424529074',
 '0.29776257618061197086', '0.09881729077887969892', '0.02257522897558370684',
 '0.21599133163597591945', '0.02365043128986757739', '0.16069861523756587274',
 '0.0']

这篇关于使用Python从Newick格式提取分支长度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Python从Newick格式提取分支长度 [英] Use Python to extract Branch Lengths from Newick Format

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python从Newick格式提取分支长度 [英] Use Python to extract Branch Lengths from Newick Format

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭