R中热图/聚类默认值的差异(热图与热图2)? [英] differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

查看:206
本文介绍了R中热图/聚类默认值的差异(热图与热图2)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在比较两种在R中使用树状图创建热图的方法,一种是使用 made4 heatplot 创建热图的方法,另一种是一个具有 gplots heatmap.2 的人。适当的结果取决于分析,但我试图理解默认值为何如此不同的原因,以及如何使两个函数给出相同的结果(或高度相似的结果),以便我理解所有的黑盒参数



这是示例数据和数据包:

  require(gplots)
#made from bioconductor
require(made4)
data(khan)
data<-as.matrix(khan $ train [1:30,])

使用heatmap.2进行数据聚类得到:

  heatmap.2(data,trace = none)



使用 heatplot 给出:

  heatplot(data)



最初的结果和缩放比例非常不同。在这种情况下, heatplot 结果看起来更合理,所以我想了解将哪些参数输入到 heatmap.2 中让它执行相同的操作,因为 heatmap.2 还有其他我想使用的优点/功能,因为我想了解缺少的成分。



heatplot 使用具有相关距离的平均链接,因此我们可以将其馈送到 heatmap.2 以确保使用类似的群集(基于:。


默认设置(第1)可以在heatmap.2中简单地进行更改,方法是提供自定义 distfun hclustfun 参数。但是p。如果不更改源代码,则无法轻松解决2和3问题。因此, heatplot 函数充当热图的包装器。2。首先,它对数据进行必要的转换,计算距离矩阵,对数据进行聚类,然后使用heatmap.2功能仅使用上述参数绘制热图。



heatplot函数中的 dualScale = TRUE 参数仅应用基于行的居中和缩放(说明)。然后,它重新分配了极端值(说明)缩放后的数据到 zlim 值:

  z < ;-t(scale(t(data())))
zlim<-c(-3,3)
z<-pmin(pmax(pmax(z,zlim [1])),zlim [2] )






为了匹配热图的输出功能,我想提出两个解决方案:



I-将新功能添加到源代码-> heatmap.3



可以在此处。随意浏览修订以查看对heatmap.2函数所做的更改。总而言之,我介绍了以下选项:




  • 在集群之前执行z分数转换: scale = c( row, column)

  • 可以在比例数据中重新分配极值: zlim = c(- 3,3)

  • 用于关闭树状图重新排序的选项: reorder = FALSE



示例:

 #require(gtools) 
#require(RColorBrewer)
cols<-colorRampPalette(brewer.pal(10, RdBu))(256)

distCor<-function(x)as .dist(1-cor(t(x)))
hclustAvg<-function(x)hclust(x,method = average)

heatmap.3(数据,跟踪= none,scale = row,zlim = c(-3,3),reorder = FALSE,
distfun = distCor,hclustfun = hclustAvg,col = rev(cols),symbreak = FALSE)






II-定义af为 heatmap.2



提供所有必需参数的函数。原始的heatmap.2, zClust 函数(如下)再现了热图执行的所有步骤。它提供(以列表格式)缩放的数据矩阵,行和列树状图。这些可以用作heatmap.2函数的输入:

 #根据分析,可以将数据居中并按行或列缩放。 
#默认参数与heatplot函数中的参数相对应。
distCor<-function(x)as.dist(1-cor(x))
zClust<-function(x,scale = row,zlim = c(-3,3) ,method = average){
if(scale == row)z<-t(scale(t(x())))
if(scale == col)z< ;-scale(x)
z<-pmin(pmax(zmax,zlim [1]),zlim [2])
hcl_row<-hclust(distCor(t(z)),method =方法)
hcl_col<-hclust(distCor(z),method = method)
return(list(data = z,Rowv = as.dendrogram(hcl_row),Colv = as.dendrogram(hcl_col)) ))
}

z<-zClust(data)

#require(RColorBrewer)
cols<-colorRampPalette(brewer.pal(10 , RdBu))(256)

heatmap.2(z $ data,trace ='none',col = rev(cols),Rowv = z $ Rowv,Colv = z $ Colv)






关于 heatmap.2(3)功能:




  • 当应用缩放时,建议使用symbreak = TRUE 。它将调整色阶,使其在0附近。在当前示例中,负值=蓝色,而正值=红色。

  • col = bluered(256)可以提供替代的着色解决方案,并且不需要RColorBrewer库。


I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to get both functions to give the same result (or highly similar result) so that I understand all the 'blackbox' parameters that go into this.

This is the example data and packages:

require(gplots)
# made4 from bioconductor
require(made4)
data(khan)
data <- as.matrix(khan$train[1:30,])

Clustering the data with heatmap.2 gives:

heatmap.2(data, trace="none")

Using heatplot gives:

heatplot(data)

very different results and scalings initially. heatplot results look more reasonable in this case so I'd like to understand what parameters to feed into heatmap.2 to get it to do the same, since heatmap.2 has other advantages/features I'd like to use and because I want to understand the missing ingredients.

heatplot uses average linkage with correlation distance so we can feed that into heatmap.2 to ensure similar clusterings are used (based on: https://stat.ethz.ch/pipermail/bioconductor/2010-August/034757.html)

dist.pear <- function(x) as.dist(1-cor(t(x)))
hclust.ave <- function(x) hclust(x, method="average")
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave)

resulting in:

this makes the row-side dendrograms look more similar but the columns are still different and so are the scales. It appears that heatplot scales the columns somehow by default that heatmap.2 doesn't do that by default. If I add a row-scaling to heatmap.2, I get:

heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave,scale="row")

which still isn't identical but is closer. How can I reproduce heatplot's results with heatmap.2? What are the differences?

edit2: it seems like a key difference is that heatplot rescales the data with both rows and columns, using:

if (dualScale) {
    print(paste("Data (original) range: ", round(range(data), 
        2)[1], round(range(data), 2)[2]), sep = "")
    data <- t(scale(t(data)))
    print(paste("Data (scale) range: ", round(range(data), 
        2)[1], round(range(data), 2)[2]), sep = "")
    data <- pmin(pmax(data, zlim[1]), zlim[2])
    print(paste("Data scaled to range: ", round(range(data), 
        2)[1], round(range(data), 2)[2]), sep = "")
}

this is what I'm trying to import to my call to heatmap.2. The reason I like it is because it makes the contrasts larger between the low and high values, whereas just passing zlim to heatmap.2 gets simply ignored. How can I use this 'dual scaling' while preserving the clustering along the columns? All I want is the increased contrast you get with:

heatplot(..., dualScale=TRUE, scale="none")

compared with the low contrast you get with:

heatplot(..., dualScale=FALSE, scale="row")

any ideas on this?

解决方案

The main differences between heatmap.2 and heatplot functions are the following:

  1. heatmap.2, as default uses euclidean measure to obtain distance matrix and complete agglomeration method for clustering, while heatplot uses correlation, and average agglomeration method, respectively.

  2. heatmap.2 computes the distance matrix and runs clustering algorithm before scaling, whereas heatplot (when dualScale=TRUE) clusters already scaled data.

  3. heatmap.2 reorders the dendrogram based on the row and column mean values, as described here.

Default settings (p. 1) can be simply changed within heatmap.2, by supplying custom distfun and hclustfun arguments. However p. 2 and 3 cannot be easily addressed, without changing the source code. Therefore heatplot function acts as a wrapper for heatmap.2. First, it applies necessary transformation to the data, calculates distance matrix, clusters the data, and then uses heatmap.2 functionality only to plot the heatmap with the above parameters.

The dualScale=TRUE argument in the heatplot function, applies only row-based centering and scaling (description). Then, it reassigns the extremes (description) of the scaled data to the zlim values:

z <- t(scale(t(data)))
zlim <- c(-3,3)
z <- pmin(pmax(z, zlim[1]), zlim[2])


In order to match the output from the heatplot function, I would like to propose two solutions:

I - add new functionality to the source code -> heatmap.3

The code can be found here. Feel free to browse through revisions to see the changes made to heatmap.2 function. In summary, I introduced the following options:

  • z-score transformation is performed prior to the clustering: scale=c("row","column")
  • the extreme values can be reassigned within the scaled data: zlim=c(-3,3)
  • option to switch off dendrogram reordering: reorder=FALSE

An example:

# require(gtools)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)

distCor <- function(x) as.dist(1-cor(t(x)))
hclustAvg <- function(x) hclust(x, method="average")

heatmap.3(data, trace="none", scale="row", zlim=c(-3,3), reorder=FALSE,
          distfun=distCor, hclustfun=hclustAvg, col=rev(cols), symbreak=FALSE) 


II - define a function that provides all the required arguments to the heatmap.2

If you prefer to use the original heatmap.2, the zClust function (below) reproduces all the steps performed by heatplot. It provides (in a list format) the scaled data matrix, row and column dendrograms. These can be used as an input to the heatmap.2 function:

# depending on the analysis, the data can be centered and scaled by row or column. 
# default parameters correspond to the ones in the heatplot function. 
distCor <- function(x) as.dist(1-cor(x))
zClust <- function(x, scale="row", zlim=c(-3,3), method="average") {
    if (scale=="row") z <- t(scale(t(x)))
    if (scale=="col") z <- scale(x)
    z <- pmin(pmax(z, zlim[1]), zlim[2])
    hcl_row <- hclust(distCor(t(z)), method=method)
    hcl_col <- hclust(distCor(z), method=method)
    return(list(data=z, Rowv=as.dendrogram(hcl_row), Colv=as.dendrogram(hcl_col)))
}

z <- zClust(data)

# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)

heatmap.2(z$data, trace='none', col=rev(cols), Rowv=z$Rowv, Colv=z$Colv)


Few additional comments regarding heatmap.2(3) functionality:

  • symbreak=TRUE is recommended when scaling is applied. It will adjust the colour scale, so it breaks around 0. In the current example, the negative values = blue, while the positive values = red.
  • col=bluered(256) may provide an alternative colouring solution, and it doesn't require RColorBrewer library.

这篇关于R中热图/聚类默认值的差异(热图与热图2)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆