系统发育树 [英] Phylogenetic tree
问题描述
我正在努力建立一个基于基因成对数据的系统发育树.下面是我的数据子集(test.txt).该树不必基于任何DNA序列构建,而只是将其视为文字.
I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.
ID gene1 gene2
1 ADRA1D ADK
2 ADRA1B ADK
3 ADRA1A ADK
4 ADRB1 ASIC1
5 ADRB1 ADK
6 ADRB2 ASIC1
7 ADRB2 ADK
8 AGTR1 ACHE
9 AGTR1 ADK
10 ALOX5 ADRB1
11 ALOX5 ADRB2
12 ALPPL2 ADRB1
13 ALPPL2 ADRB2
14 AMY2A AGTR1
15 AR ADORA1
16 AR ADRA1D
17 AR ADRA1B
18 AR ADRA1A
19 AR ADRA2A
20 AR ADRA2B
下面是我在R中的代码
library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
我的身影附在这里
我对它们如何成簇有疑问.自成对
I have a question on how they are clustered.Since the pairs
17 AR ADRA1B
18 AR ADRA1A
和
2 ADRA1B ADK
3 ADRA1A ADK
应该紧密聚集,因为它们具有一个共同的基因.因此17和2应该在一起,而18和3应该在一起.
should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.
如果使用此方法(欧几里得距离)有误,是否应该使用其他方法?
Should I use any other method,if I am wrong in using this method(Euclidean distance)?
我应该将数据转换为行和列的矩阵,其中gene1是x轴,gene2是y轴,每个单元格由1还是0填充?(基本上,如果它们配对,则意味着1,并且如果不是,则为0)
Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)
更新的代码:
table=table(tab$gene1, tab$gene2)
d <- dist(table,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
但是,在这种情况下,我只能从gene1列获得基因,而不能从gene2列获取.下图正是我想要的,但也应该从gene2列获得基因
However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well
推荐答案
问题示例中有一些解释的空间.我的答案只有在每个个体确实只有两个基因并且每一行描述一个个体的情况下才有效.但是,如果我认为每一行都意味着gene1
与gene2
一起出现,则无法进行有用的聚类.在那种情况下,我希望有一个额外的列来说明其常见发生的可能性,并且可能更喜欢主成分分析(PCA)之类的方法,但是我离成为(分层)聚类的专家还差得远.
There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1
occurs with gene2
with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.
在使用dist
功能之前,必须将数据转换为适当的格式:
Before you can use the dist
function, you have to bring your data into an appropriate format:
# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)
获得的gene.matrix
具有以下形状:
ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,] 1 0 1 0 0 0 1 0
[2,] 2 0 1 0 0 1 0 0
[3,] 3 0 1 0 1 0 0 0
[4,] 4 0 0 0 0 0 0 0
...
因此,每一行代表一个观察值(=个人),其中第一列标识个体,随后的每一列包含1
(如果存在该基因)和0
(如果缺少该基因).在此矩阵上,可以合理地应用dist
函数(删除ID列):
So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1
if the gene is present and 0
if it is missing. On this matrix the dist
function can be reasonably applied (ID column removed):
d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
也许,最好读取距离度量euclidean
,manhattan
等之间的差异.例如,具有ID=1
和ID=2
的个体之间的欧几里得距离为:
Maybe, it is a good idea to read up the differences between the distance measures euclidean
, manhattan
etc. For instance, the euclidian distance between the individuals with ID=1
and ID=2
is:
euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)
曼哈顿距离
manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...
这篇关于系统发育树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!