如何将data.frame转换为树结构对象,如树形图 [英] how to convert a data.frame to tree structure object such as dendrogram

查看:767
本文介绍了如何将data.frame转换为树结构对象,如树形图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.frame对象。一个简单的例子:

 > data.frame(x = c('A','A','B','B','B'),y = c('Ab','Ac','Ba','Ba' Bd'),z = c('Abb','Acc','Bad','Bae','Bdd'))
xyz
1 A Ab Abb
2 A Ac Acc
3 B Ba
4 B Ba Bae
5 B Bd Bdd

在实际数据中有更多的行和列。如何创建树状图的嵌套树结构对象,如下所示:

  | --- Ab --- Abb 
A --- |
| | --- Ac --- Acc
- | / - 坏
| | --- Ba ------- |
B --- | \ - Bae
| --- Bb --- Bdd


解决方案

data.frame到Newick



我做了我的计算系统发育博士学位,在我生成这段代码的地方,我用了一两次当我得到这种非标准格式的数据(在系统发育意义上)。脚本遍历数据帧,就像它是一个树...并将东西粘贴到Newick字符串中,这是一种标准格式,可以在任何种类的树形对象中进行转换。



我猜可以优化脚本(我很少使用它,所以更多的工作会降低整体效率),但至少比共享更好的是让它收集在我的harddrive。

  ##递归函数
遍历< - 函数(a,i,innerl){
if(i<(ncol(df))){
alevelinner < - as.character(unique(df [which(as.character(df [,i])== a),i + 1 ])
desc < - NULL
if(length(alevelinner)== 1)(newickout< - traverse(alevelinner,i + 1,innerl))
else {$ b (b in alevelinner)desc < - c(desc,traverse(b,i + 1,innerl))
il < - NULL; if(innerl == TRUE)il < - a
(newickout< - paste(,(,paste(desc,collapse =,),),il,sep =
}
}
else {(newickout< - a)}
}

## data.frame to newick function
df2newick < - function(df,innerlabel = FALSE){
alevel < - as.character(unique(df [,1]))
newick< - NULL
for alevel)newick< - c(newick,traverse(x,1,innerlabel))
(newick< - paste((,paste(newick,collapse =,),);, sep =))
}

主要功能 df2newick() 有两个参数:




  • df 哪个是要转换的数据帧(类data.frame的对象)

  • innerlabel ,它告诉函数为内部节点写入标签(bulean)



你的例子:

  df<  -  data.frame(x = c('A','A','B','B' ),y = c('Ab','Ac','Ba','Ba','Bd'),z = c('Abb','Acc','Bad','Bae','Bdd' )
myNewick< - df2newick(df)
#[1]((Abb,Acc),((Bad,Bae),Bdd));

现在,您可以将其读入类 phylo with read.tree() from ape

  library( ape)
mytree< - read.tree(text = myNewick)
plot(mytree)

如果要向Newick字符串添加内部节点标签,可以使用以下内容:

  myNewick< -  df2newick(df,TRUE)
#[1]((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);

希望这是有用的(也许我的博士不是一个完整的时间; - ) / p>




数据框架格式的附加说明:


$ b $你可以观察到df2newick函数忽略一个孩子的内部模式(这是最适合用于大多数系统发育方法...只与我有关)。我原来使用此脚本的 df 对象具有以下格式:

  df<  -  data.frame(x = c('A','A','B','B','B'),y = c('Abb','Acc' b','B'','Bdd'),z = c('Abb','Acc','Bad','Bae','Bdd'))
pre>

非常类似于你的...但是内部单个孩子节点只是和他们的孩子有相同的名字,但是这个节点也有不同的内部名称,并且名称被忽略...可能不相关,但您可以忽略递归函数的一部分,如下所示:

 遍历<  -  function(a,i,innerl){
if(i <(ncol(df))){
alevelinner< - as.character(unique(df [ as.character(df [,i])== a),i + 1]))
desc < - NULL
## if(length(alevelinner)== 1)(newickout< - 遍历(alevelinner,i + 1,innerl))
## else {
for (b in alevelinner)desc <-C(desc,traverse(b,i + 1,innerl))
il < - NULL; if(innerl == TRUE)il < - a
(newickout< - paste(,(,paste(desc,collapse =,),),il,sep =
##}
}
else {(newickout< - a)}
}

,你会得到这样的东西:

  [1](((Abb )Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B); 

这对我来说真的很奇怪,但是我添加它以防万一,因为它真的包括现在所有来自原始数据框的信息。


I have a data.frame object. For a simple example:

> data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
  x  y   z
1 A Ab Abb
2 A Ac Acc
3 B Ba Bad
4 B Ba Bae
5 B Bd Bdd

there are a lot more rows and columns in the actual data. how could I create a nested tree structure object of dendrogram like this:

         |---Ab---Abb
     A---|
     |   |---Ac---Acc
   --|                 /--Bad 
     |   |---Ba-------|
     B---|             \--Bae
         |---Bb---Bdd

解决方案

data.frame to Newick

I did my PhD in computational phylogenetics and somewhere along the way I produced this code, that I used once or twice when I got some data in this nonstandard format (in phylogenetic sense). The script traverses the dataframe as if it were a tree ... and pastes stuff along the way into a Newick string, which is a standard format and can be then transformed in any kind of tree object.

I guess the script could be optimized (I used it so rarely that more work on it would reduce the overall efficiency), but at least it is better to share than to let it collect dust laying around on my harddrive.

    ## recursion function
    traverse <- function(a,i,innerl){
        if(i < (ncol(df))){
            alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
            desc <- NULL
            if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
            else {
                for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                il <- NULL; if(innerl==TRUE) il <- a
                (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
            }
        }
        else { (newickout <- a) }
    }

    ## data.frame to newick function
    df2newick <- function(df, innerlabel=FALSE){
        alevel <- as.character(unique(df[,1]))
        newick <- NULL
        for(x in alevel) newick <- c(newick,traverse(x,1,innerlabel))
        (newick <- paste("(",paste(newick,collapse=","),");",sep=""))
    }

The main function df2newick() takes two arguments:

  • df which is the dataframe to be transformed (object of class data.frame)
  • innerlabel which tells the function to write labels for inner nodes (bulean)

To demonstrate it on your example:

    df <- data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
    myNewick <- df2newick(df)
    #[1] "((Abb,Acc),((Bad,Bae),Bdd));"

Now you could read it into a object of class phylo with read.tree() from ape

    library(ape)
    mytree <- read.tree(text=myNewick)
    plot(mytree)

If you want to add inner node labels to the Newick string, you can use this:

    myNewick <- df2newick(df, TRUE)
    #[1] "((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);"

Hope this is useful (and maybe my PhD wasn't a complete waist of time ;-)


Additional note for your dataframe format:

As you can observe the df2newick function ignores inner modes with one child (which is anyway best to be used with most phylogenetic methods ... was only relevant to me). The df objects that I originally got and used with this script were of this format:

    df <- data.frame(x=c('A','A','B','B','B'), y=c('Abb','Acc','Ba', 'Ba','Bdd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))

Very similar to yours ... but the "inner singe child nodes" just had the same name as their children, but you have different inner names for this nodes too, and the names get ignored ... might not be relevant but you can just ignore a part of the recursion function, like this:

    traverse <- function(a,i,innerl){
        if(i < (ncol(df))){
            alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
            desc <- NULL
            ##if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
            ##else {
                for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                il <- NULL; if(innerl==TRUE) il <- a
                (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
            ##}
        }
        else { (newickout <- a) }
    }

and you would get something like this:

    [1] "(((Abb)Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B);"

This really looks odd to me, but I add it just in case, cause it really includes now all the information from your original dataframe.

这篇关于如何将data.frame转换为树结构对象,如树形图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆