如果给出一个外群,我如何为一组物种生成所有可能的Newick树排列? [英] How do I generate all possible Newick Tree permutations for a set of species given an outgroup?

查看:157
本文介绍了如果给出一个外群,我如何为一组物种生成所有可能的Newick树排列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果给出一个外群,如何为一组物种生成所有可能的Newick树排列?

How do I generate all possible Newick Tree permutations for a set of species given an outgroup?

对于那些不知道Newick树格式是什么的人,a有关详细说明,请访问:
https://en.wikipedia.org/wiki/Newick_format

For those who don't know what Newick tree format is, a good description is available at: https://en.wikipedia.org/wiki/Newick_format

我想在给出外群的情况下为一组物种创建所有可能的Newick Tree排列。我期望处理的叶节点数量很可能是4,5或6个叶节点。

I want to create all possible Newick Tree permutations for a set of species given an outgroup. The number of leaf nodes I expect to process are most likely 4, 5, or 6 leaf nodes.

允许软和硬多边形。
https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs。 _hard_polytomies
https://biology.stackexchange.com/questions / 23667 /证据讨论 - 硬拷贝

Both "Soft" and "hard" polytomies are allowed. https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy

下面显示的是理想输出,E设置为outgroup

Shown below is the ideal output, with "E" set as the outgroup

理想输出:

((("A","B","C"),("D"),("E"));
((("A","B","D"),("C"),("E"));
((("A","C","D"),("B"),("E"));
((("B","C","D"),("A"),("E"));
((("A","B")("C","D"),("E"));
((("A","C")("B","D"),("E"));
((("B","C")("A","D"),("E"));
(("A","B","C","D"),("E"));
(((("A","B"),"C"),"D"),("E"));

然而,我使用itertools带来的任何可能的解决方案,特别是itertools.permutations,都遇到过专业版等效输出的瑕疵。我想出的最后一个想法涉及下面显示的等效输出。

However, any possible solutions that I've come with using itertools, specifically itertools.permutations, have come across the problem of equivalent output. The last idea I came up with involved the equivalent output that is shown below.

等价输出:

((("C","B","A"),("D"),("E"));
((("C","A","B"),("D"),("E"));
((("A","C","B"),("D"),("E"));

这是我对解决方案的想法的开始。但是,我不是真的确定现在除了itertools之外还有什么关于这个问题。

Here is the start of my idea for a solution. However, I'm not really sure what to about this problem besides itertools for now.

import itertools

def Newick_Permutation_Generator(list_of_species, name_of_outgroup)
    permutations_list =(list(itertools.permutations(["A","B","C","D","E"])))

    for given_permutation in permutations_list:
        process(given_permutation)

Newick_Permutation_Generator(["A","B","C","D","E"], "E")


推荐答案

树作为递归的叶子集



让我们暂时搁置newick表示,并想一下可能的pyt表示问题。

A tree as a recursive set of sets of leaves

Let's set aside the newick representation for the moment, and think of a possible python representation of the problem.

根树可以被视为递归的集合层次结构(集合... )) 树叶。集合是无序的,非常适合描述树中的分支: {{{A,B},{C,D}},E} 应与 {{{C,D},{B,A}},E} 。

A rooted tree can be viewed as a recursive hierarchy of sets of (sets of (sets of ...)) leaves. Sets are unordered, which is quite adapted to describe clades in a tree: {{{"A", "B"}, {"C", "D"}}, "E"} should be the same thing as {{{"C", "D"}, {"B", "A"}}, "E"}.

如果我们考虑初始叶子集 {A,B,C,D,E } ,以E作为outgroup的树是 {tree,E} 形式的集合集合,其中 tree s取自可以从叶子集构建的树集合 {A,B,C,D } 。我们可以尝试编写一个递归的函数来生成这组树,我们的树集总和将表示如下:

If we consider the initial set of leaves {"A", "B", "C", "D", "E"}, the trees with "E" as outgroup are the set of sets in the form {tree, "E"} where trees are taken from the set of trees that can be built from the set of leaves {"A", "B", "C", "D"}. We could try to write a recursive trees function to generate this set of trees, and our total set of trees would be expressed as follows:

{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}

(这里,我使用 set comprehension 表示法。)

(Here, I use the set comprehension notation.)

实际上,python不允许集合集,因为set必须是hashable(也就是说,python必须能够计算对象的一些哈希值才能检查它们是否属于该集合)。碰巧python集没有这个属性。幸运的是,我们可以使用名为 frozenset <的类似数据结构/ code> ,其行为与集合相似,但无法修改且可以。因此,我们的树组将是:

Actually, python doesn't allow sets of sets, because the elements of a set must be "hashable" (that is, python must be able to compute some "hash" values of objects to be able to check whether they belong or not to the set). It happens that python sets do not have this property. Fortunately, we can use a similar data structure named frozenset, which behaves quite like a set, but cannot be modified and is "hashable". Therefore, our set of trees would be:

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})



实现函数



现在让我们关注 function。

Implementing the trees function

Now let's focus on the trees function.

对于每个可能的分区(分解为一组不相交的子集,包括所有元素)叶子集,我们需要为分区的每个部分找到所有可能的树(通过递归调用)。对于给定的分区,我们将为每个可能的子树组合创建一个树。

For each possible partition (decomposition into a set of disjoint subsets, including all elements) of the set of leaves, we need to find all possible trees (through a recursive call) for each part of the partition. For a given partition, we will then make a tree for each possible combination of subtrees taken across its parts.

例如,如果分区是 {A,{B,C,D}} ,我们将考虑可以从A部分生成的所有树木(实际上,只是叶A本身),以及可以从 {B部分生成的所有可能的树,C,D} (即树({B,C,D})) 。然后,通过获取所有可能的对来获得该分区的可能树,其中一个元素来自A,另一个元素来自树({B,C,D})

For instance, if a partition is {"A", {"B", "C", "D"}}, we will consider all possible trees that can be made from part "A" (actually, just the leaf "A" itself), and all possible trees that can be made from part {"B", "C", "D"} (that is, trees({"B", "C", "D"})). Then, the possible trees for this partition will be obtained by taking all possible pairs where one element comes from just "A", and the other from trees({"B", "C", "D"}).

这可以针对具有两个以上部分的分区进行推广, itertools 中的 product 函数似乎在这里很有用。

This can be generalized for partitions with more than two parts, and the product function from itertools seems to be useful here.

因此,我们需要一种方法来生成一组叶子的可能分区。

Therefore, we need a way to generate the possible partitions of a set of leaves.

在这里,我根据此解决方案改编了 partitions_of_set 功能

Here I made a partitions_of_set function adapted from this solution:

# According to https://stackoverflow.com/a/30134039/1878788:
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
    if len(s) == 1:
        yield frozenset(s)
        return
    # Extract one element from the set
    # https://stackoverflow.com/a/43804050/1878788
    elem, *_ = s
    rest = frozenset(s - {elem})
    for partition in partitions_of_set(rest):
        for subset in partition:
            # Insert the element in the subset
            try:
                augmented_subset = frozenset(subset | frozenset({elem}))
            except TypeError:
                # subset is actually an atomic element
                augmented_subset = frozenset({subset} | frozenset({elem}))
            yield frozenset({augmented_subset}) | (partition - {subset})
        # Case with the element in its own extra subset
        yield frozenset({elem}) | partition

为了检查获得的分区,我们提供了一个功能,使它们更容易显示(也将在以后制作树的newick表示非常有用):

To check the obtained partitions, we make a function to make them easier to display (that will also be useful to make a newick representation of the trees later):

def print_set(f):
    if type(f) not in (set, frozenset):
        return str(f)
    return "(" + ",".join(sorted(map(print_set, f))) + ")"

我们测试分区是否有效:

We test that the partitioning works:

for partition in partitions_of_set({"A", "B", "C", "D"}):
    print(len(partition), print_set(partition))

输出:

1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)



实际代码函数



现在我们可以编写函数:

from itertools import product
def trees(leaves):
    if type(leaves) not in (set, frozenset):
        # It actually is a single leaf
        yield leaves
        # Don't try to yield any more trees
        return
    # Otherwise, we will have to consider all the possible
    # partitions of the set of leaves, and for each partition,
    # construct the possible trees for each part
    for partition in partitions_of_set(leaves):
        # We need to skip the case where the partition
        # has only one subset (the initial set itself),
        # otherwise we will try to build an infinite
        # succession of nodes with just one subtree
        if len(partition) == 1:
            part, *_ = partition
            # Just to be sure the assumption is correct
            assert part == leaves
            continue
        # We recursively apply *tree* to each part
        # and obtain the possible trees by making
        # the product of the sets of possible subtrees.
        for subtree in product(*map(trees, partition)):
            # Using a frozenset guarantees
            # that there will be no duplicates
            yield frozenset(subtree)

测试它:

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})

for tree in all_trees:
    print(print_set(tree) + ";")

输出:

(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);

我希望结果是正确的。

这种方法对于正确的做法有点棘手。我花了一些时间来弄清楚如何避免无限递归(这种情况发生在分区为 {{A,B,C,D}} )。

This approach was a bit tricky to get right. It took me some time to figure out how to avoid the infinite recursion (This happens when the partition is {{"A", "B", "C", "D"}}).

这篇关于如果给出一个外群,我如何为一组物种生成所有可能的Newick树排列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆