分类变量计数的热图 [英] Heatmap of categorical variable counts

查看:157
本文介绍了分类变量计数的热图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目的数据框,每个项目都有多个分类变量列。

  ID test1 test2 test3 
1 ABA
2 BAC
3 CCC
4 AAB
5 BBB
6 BAC
热图将在x侧的那个测试的列中包含所有因子(在本例中为A,B,C),在y侧具有另一个测试的所有因子,并且热图中的框应为根据具有分类器组合的ID的数量来上色。



例如,在上面的输入中,如果我们在test1和test2之间有热图,则位于test1的B和test2的A交点处的框最亮,因为该组合有2个ID。
我希望使用这些热图来分析哪些测试最适合该数据集,但是由于它们是分类变量,因此不能使用Pearson的R相关性。



我对ggplot很熟悉,这就是为什么我更喜欢ggplot的原因,但是如果在pheatplot中更容易,我可以学习它。

解决方案

花一些时间来了解如何做到这一点,但我不确定这是最好的方法。



数据:

  dat = structure(list(ID = 1:6,
test1 = c( A, B, C, A, B, B),
test2 = c( B, A, C, A, B, A),
test3 = c( A, C, C, B, B, C)
),
.Names = c( ID, test1, test2, test3),
class = data.frame,row.names = c(NA,-6L)





 图书馆(tidyverse)
图书馆(ggthemes)
图书馆(gridExtra)



全部创建一次取2个因子的组合(也测验)

  fcombs<-expand.grid(LETTERS [1:3],LETTERS [1 :3],stringsAsFactors = F)
tcombs<-as.data.frame(combn(colnames(dat [,-1]),2),stringsAsFactors = F)



愉快地通过测试组合, full_join ,计数长度为o如果每个组不包括 NA s

  dtl<-lapply(tcombs,function(i ){
select(dat,ID,i)%>%
full_join(x = fcombs,by = c( Var1 = i [1],Var2 = i [2]))% >%
group_by(Var1,Var2)%>%
mutate(N = sum(!is.na(ID)),ID = NULL)%>%
ungroup( )
}



创建地块列表

  pl<-lapply(seq_along(tcombs),function(i){
gtitle = paste(tcombs [[i]],崩溃=〜)
dtl [[i]]%>%
ggplot(aes(x = Var1,y = Var2,fill = N))+
geom_tile()+
theme_tufte() +
theme(axis.title = element_blank())+
ggtitle(gtitle)
}



创建表列表( tableGrob 对象)

  tbl<-lapply(tcombs,function(i)tableGrob(select(dat,ID,i),
theme = ttheme_minimal()))



将所有内容放入结果列表并绘制

  resl<-c(pl,tbl)[c(1、4、2、5、3、6)] 

grid.arrange(grobs = resl,ncol = 2,nrow = 3)


I have a data frame of items, and each has multiple classifier columns that are categorical variables.

ID    test1    test2     test3
1     A        B         A
2     B        A         C
3     C        C         C
4     A        A         B
5     B        B         B
6     B        A         C

I want to generate a heatmap for each combination of test columns (test1 v test2, test1 v test3, etc.) using ggplot2. The heatmap would have all factors in that test's column (in this case A,B,C) on the x-side and all factors of the other test on the y-side, and the boxes in the heatmap should be colored based on the count of ids that have that combination of classifier.

For example in the above input, if we have heatmap between test1 and test2, then the box that is in the intersection of B for test1 and A for test2 would be brightest, since there are 2 ids with that combination. I hope to use these heatmaps to analyze which tests are most congruent for the data set, but can't use a Pearson's R correlation since they are categorical variables.

I am familiar with ggplot, which is why I prefer that package, but if it is easier in pheatplot, I am okay with learning that.

解决方案

Took some time to realize how to do it, and still I am not sure it is the best way.

Data:

dat = structure(list(ID = 1:6, 
                     test1 = c("A", "B", "C", "A", "B", "B"), 
                     test2 = c("B", "A", "C", "A", "B", "A"), 
                     test3 = c("A", "C", "C", "B", "B", "C")
                     ), 
                .Names = c("ID", "test1", "test2", "test3"), 
                 class = "data.frame", row.names = c(NA, -6L)
                )

Libraries

library(tidyverse)
library(ggthemes)
library(gridExtra)

Create all all combinations of factors (also tests) taken 2 at a time

fcombs <- expand.grid(LETTERS[1:3], LETTERS[1:3], stringsAsFactors = F)
tcombs <- as.data.frame(combn(colnames(dat[,-1]), 2), stringsAsFactors = F)

lapply through the tests combinations, full_join, count length of each group excluding NAs

dtl <- lapply(tcombs, function(i){
        select(dat, ID, i) %>%
        full_join(x = fcombs, by = c("Var1" = i[1], Var2 = i[2])) %>%
        group_by(Var1, Var2) %>%
        mutate(N = sum(!is.na(ID)), ID = NULL) %>%
        ungroup()
  }
)

Create a list of plots

pl <- lapply(seq_along(tcombs), function(i){
        gtitle = paste(tcombs[[i]], collapse = " ~ ")
        dtl[[i]] %>%
        ggplot(aes(x = Var1, y = Var2, fill = N)) +
        geom_tile() +
        theme_tufte() +
        theme(axis.title = element_blank()) +
        ggtitle(gtitle)
        }
      )

Create list of tables (tableGrob objects)

tbl <- lapply(tcombs, function(i) tableGrob(select(dat, ID, i),  
                                            theme = ttheme_minimal()))

Put everything into the resulting list and plot

resl <- c(pl, tbl)[c(1, 4, 2, 5, 3, 6)]

grid.arrange(grobs = resl, ncol = 2, nrow = 3)

这篇关于分类变量计数的热图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆