确定关闭此树的临界值的算法? [英] Algorithm to decide cut-off for collapsing this tree?

查看:112
本文介绍了确定关闭此树的临界值的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一棵 Newick 树,该树是通过比较Position的相似性(欧式距离)构建的推定的DNA调节基序的权重矩阵(PWM或PSSM),该基序为4-9 bp长的DNA序列.

I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences.

树的交互式版本在iTol上(

An interactive version of the tree is up on iTol (here), which you can freely play with - just press "update tree" after setting your parameters:

我的具体目标:如果它们与最近的父进化枝的平均距离小于< X( ETE2 Python包).这在生物学上是有趣的,因为一些基因调节DNA基序可以彼此同源(旁系同源物或直向同源物).可以通过上面链接的iTol GUI(例如,如果选择X = 0.001,则某些图案会折叠成三角形(基序系列).

My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X (ETE2 Python package). This is biologically interesting since some of the gene regulatory DNA motifs may be homologous (paralogues or orthologues) with one another. This collapsing can be done via the iTol GUI linked above, e.g. if you choose X = 0.001 then some motifs become collapsed into triangles (motif families).

我的问题:有人可以建议一种算法,该算法可以输出或帮助可视化X的哪个值适合于最大化折叠的基序的生物学或统计学相关性"吗?理想情况下,针对X绘制时,树的某些属性将发生一些明显的阶跃变化,这向算法建议了一个明智的X.为此是否有任何已知的算法/脚本/程序包?也许代码会针对X的值绘制一些统计信息?我尝试绘制X与平均群集大小的关系图( matplotlib ),但是我看不到明显的步长增加" ",以告知我要使用X的哪个值:

My question: Could anybody suggest an algorithm that would either output or help visualise which value of X is appropriate for "maximizing the biological or statistical relevance" of the collapsed motifs? Ideally there would be some obvious step change in some property of the tree when plotted against X which suggests to the algorithm a sensible X. Are there any known algorithms/scripts/packages for this? Perhaps the code will plot some statistic against the value of X? I've tried plotting X vs. mean cluster size (matplotlib) but I don't see an obvious "step increase" to inform me which value of X to use:

我的代码和数据:我的Python脚本的链接位于[here] [8],我对此进行了评论,它会生成树数据并为您绘制图(使用参数d_from,d_to和d_step探索距离边界X).如果您具有Easy-install和Python,则只需执行以下两个bash命令即可安装ete2:

My code and data: A link to my Python script is [here][8], I have heavily commented it and it will generate the tree data and plot above for you (use the arguments d_from, d_to and d_step to explore the distance cut-offs, X). You will need to install ete2 by simply executing these two bash commands if you have easy-install and Python:

apt-get install python-setuptools python-numpy python-qt4 python-scipy python-mysqldb python-lxml

easy_install -U ete2

推荐答案

您可以尝试使用类似于@Jeff提到的树协调的方法.但是标准的树协调实际上将失败.

You could try and use something similar to tree reconciliation as @Jeff mentioned. But standard tree reconciliation will actually fail.

和解涉及首先添加代表整个目标树中进化角色损失"的分支.然后指示发生了进化特征重复"的节点.损失和重复的加权总和提供了要优化的成本函数.

Reconciliation involves firstly adding branches that represent "losses" of evolutionary characters throughout the target tree. Then indicating the nodes at which "duplications" of evolutionary characters have occurred. The weighted sum of losses and duplications provide a cost function to optimise for.

但是,在您的情况下,您要解决的问题是将超级树分解为适当大小的直系同源子树".这意味着您实际上并不想像重复一样为损失计分.您需要一种对树进行评分的方法,以显示出有多少个直系同源子树合并到您的超树中.因此,您可以尝试这种计分方法:

But in your case, the problem you want to solve is "break this super-tree into appropriately sized, orthologous sub-trees". This means you don't really want to score losses as much as you would duplications. You want a way to score the tree such that it reveals how many orthologous sub-trees are merged into your super-tree. Thus you can try this scoring approach:

  1. 以超级树为例,计算重复物种S1的数量.
  2. 折叠所有属于旁系同源物的末端叶片,并计算新的重复物种S2的数量.
  3. S1和S2之间的差异揭示了超级树中大约有多少个子树.
  4. 校正由各种大小的超级树引起的任何偏差除以超级树N中表示的唯一物种的数量.

如果我们将此分数称为子树因子",则它等于:

If we call this score the "sub-tree factor" then it equates to:

S1-S2/N

推论:

  • 如果S1-S2 = S1,则表明您的超级树中大约有一个真正的子树,所有多个物种的出现都是由于最近的同系物引起的.

  • If S1 - S2 = S1 then it means your super-tree has approximately one true sub-tree within it, that all multiple species occurrences were just due to recent paralogues.

如果S1-S2 = 0,则意味着您的超级树中大约包含S1个真正的子树.

If S1 - S2 = 0 then it means your super-tree has approximately S1 true sub-trees within it.

这篇关于确定关闭此树的临界值的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆