系统发育树 [英] Phylogenetic tree

查看:162
本文介绍了系统发育树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力建立一个基于基因成对数据的系统发育树.下面是我的数据子集(test.txt).该树不必基于任何DNA序列构建,而只是将其视为文字.

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.

ID  gene1   gene2

1   ADRA1D  ADK
2   ADRA1B  ADK
3   ADRA1A  ADK
4   ADRB1   ASIC1
5   ADRB1   ADK
6   ADRB2   ASIC1
7   ADRB2   ADK
8   AGTR1   ACHE
9   AGTR1   ADK
10  ALOX5   ADRB1
11  ALOX5   ADRB2
12  ALPPL2  ADRB1
13  ALPPL2  ADRB2
14  AMY2A   AGTR1
15  AR  ADORA1
16  AR  ADRA1D
17  AR  ADRA1B
18  AR  ADRA1A
19  AR  ADRA2A
20  AR  ADRA2B

下面是我在R中的代码

library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))

我的身影附在这里

我对它们如何成簇有疑问.自成对

I have a question on how they are clustered.Since the pairs

 17 AR  ADRA1B
 18 AR  ADRA1A

 2  ADRA1B  ADK
 3  ADRA1A  ADK

应该紧密聚集,因为它们具有一个共同的基因.因此17和2应该在一起,而18和3应该在一起.

should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.

如果使用此方法(欧几里得距离)有误,是否应该使用其他方法?

Should I use any other method,if I am wrong in using this method(Euclidean distance)?

我应该将数据转换为行和列的矩阵,其中gene1是x轴,gene2是y轴,每个单元格由1还是0填充?(基本上,如果它们配对,则意味着1,并且如果不是,则为0)

Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)

更新的代码:

   table=table(tab$gene1, tab$gene2)
   d <- dist(table,method="euclidean")
   fit <- hclust(d, method="ward")
   plot(as.phylo(fit))

但是,在这种情况下,我只能从gene1列获得基因,而不能从gene2列获取.下图正是我想要的,但也应该从gene2列获得基因

However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well

推荐答案

问题示例中有一些解释的空间.我的答案只有在每个个体确实只有两个基因并且每一行描述一个个体的情况下才有效.但是,如果我认为每一行都意味着gene1gene2一起出现,则无法进行有用的聚类.在那种情况下,我希望有一个额外的列来说明其常见发生的可能性,并且可能更喜欢主成分分析(PCA)之类的方法,但是我离成为(分层)聚类的专家还差得远.

There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.

在使用dist功能之前,必须将数据转换为适当的格式:

Before you can use the dist function, you have to bring your data into an appropriate format:

# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)

获得的gene.matrix具有以下形状:

     ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,]  1    0   1      0      0      0      1      0
[2,]  2    0   1      0      0      1      0      0
[3,]  3    0   1      0      1      0      0      0
[4,]  4    0   0      0      0      0      0      0
...

因此,每一行代表一个观察值(=个人),其中第一列标识个体,随后的每一列包含1(如果存在该基因)和0(如果缺少该基因).在此矩阵上,可以合理地应用dist函数(删除ID列):

So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):

d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))

也许,最好读取距离度量euclideanmanhattan等之间的差异.例如,具有ID=1ID=2的个体之间的欧几里得距离为:

Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:

euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)

曼哈顿距离

manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...

这篇关于系统发育树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆