使用ggplot2从已汇总的计数中获得3层堆叠的直方图 [英] 3 layer Stacked histogram from already summarized counts using ggplot2

查看:117
本文介绍了使用ggplot2从已汇总的计数中获得3层堆叠的直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一些帮助为从data.frame中的汇总数据生成的ggplot2直方图着色的方法.

I would like some help coloring a ggplot2 histogram generated from summarized data in a data.frame.

我正在使用的数据集是[R]内置的(USArrests)数据集.

The dataset I'm using is the [R] build in (USArrests) dataset.

我正在尝试调整针对问题,阿伦.

I'm trying to adapt the solution that was given to this question by arun.

理想的结果是制作一个犯罪"的直方图,并根据c(突击",强奸",谋杀")的相对贡献为每个小节上色.

The desired result is to make a histogram of "Crime" and color each bar according to the relative contribution of c("Assault", "Rape", "Murder").

代码:

attach(USArrests)

#Create vector SUM arrests per state
Crime <- with(USArrests, Murder+ Rape+ Assault)

#bind Vector Crime to dataframe USArrets and name it USArrests.transform
USArrests.transform <- cbind (USArrests, Crime)

#See if package is installed, and do if not
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}

ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get data of crime plot: cols = count, xmin and xmax
crime.data <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")]
# add a id colum for ddply
crime.data$id <- seq(nrow(crime.data))

#See if package is installed, and do if not
if (!require("plyr")) {
  install.packages("plyr")
  library(plyr)
}

#Split data frame, apply function en return results in a data frame: ddply
crime.data.transform <- ddply(crime.data, .(id), function(x) {
  tranche <- USArrests.transform[USArrests.transform$Crime >= x$xmin & USArrests.transform$Crime <= x$xmax, ]
  if(nrow(tranche) == 0) return(c(x$x, 0, 0))
  crime.plot <- c(x=x$x, colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["Crime"] * x$count)
})

#See if package is installed, and do if not
if (!require("reshape2")) {
  install.packages("reshape2")
  library(reshape2)
}

crime.data.transform <- melt(crime.data.transform, id.var="id")
ggplot(data = crime.data.transform, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)

[错误]:上面给出了以下错误:

[Error]: The above gives the following error:

Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

随后在整形后部分出现了错误.

Subsequently the are errors in part after the reshape.

在上面的示例中,关于我在做什么错以及如何解决的任何建议?

Any suggestions on what I'm doing wrong and how it could be solved in the above example?

推荐答案

很抱歉,我的回答很长,我想进行一些代码优化.通常,代码不是您的代码,但是即使在arun的代码中,我也发现了一些优化的空间.让我们来看看我所做的更改:

Sorry for the long answer I felt like doing some code optimisation. Mostly the code is not yours, but even in arun's code I found some room for optimisation. Let's go through what I changed:

  1. 我删除了attach语句,因为它不是必需的,并且如果您使用多个数据集,则使用attach是不好的做法-主要是因为您对数据结构的了解松散
  2. 如果创建序列且步骤为1,则仅使用:而不是seq.我在此处解释了原因
  3. 您的代码中的错误:在return(c(x$x, 0, 0))中有一个零到零.
  4. 此外,您无需在ddply功能内使用x$x.因此,它应该只是return(c(0,0,0)),而在下一行中,它应该是c(colSums(tranche)[c("Murder", "Assault", "Rape")].否则,R还将绘制所有x值.
  5. 哎呀!实际上,您实际上不需要这里的plyr.此ddply函数只是对crime.data -data.frame的行的简单循环.您可以使用lapply -loop
  6. 来实现
  1. I removed your attach statement, because it was not needed and if you work with multiple datasets it is bad practise to use attach - mainly because you loose track of your data structures
  2. If you create a sequence and the step is 1, just use : and not seq. I explained here why
  3. The error in your code: In return(c(x$x, 0, 0)) there is one zero to little.
  4. In addition you do not need x$x inside the ddply-function. Thus it should just be return(c(0,0,0)) and in the next line it needs to be c(colSums(tranche)[c("Murder", "Assault", "Rape")]. Otherwise R will plot all the x values as well.
  5. Heck! You actually do not need plyr here. This ddply-function is just a simple loop over the rows of your crime.data-data.frame. That is something you can achieve using an lapply-loop

在这里,我可能需要解释一下:plyr -package试图克服apply -family-functions的缺点.除了lapply之外,它们的行为都是不可预测的.特别是sapply可能会从vectorlist对象返回从vector的任何内容.只有lapply是可靠的-它始终会为您提供list结果:

Here I maybe need to explain a bit: The plyr-package tried to overcome the shortcomings of the apply-family-functions. Except for lapply, their behaviour is rather unpredictable. Especially sapply might return anything from vector over matrix to list-objects. Only lapply is reliable - it always gives you a list result:

USArrests_sum <- cbind (USArrests, arrests=with(USArrests, Murder+ Rape+ Assault))

#See if package is installed, and do if not
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}

# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests_sum, aes(x= arrests)) + geom_histogram()
crime_df <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")] # get data of crime plot: cols = count, xmin and xmax
crime_df$id = 1:nrow(crime_df) #add a id colum for ddply

#Split data frame, apply function en return results in a data frame: ddply
tranche_list<-lapply(1:nrow(crime_df), function(j) {
  myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
  tranche <- USArrests_sum[myrows,]
  if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
  crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})

另一种选择是使用dplyr来转换您的数据,也许其他人会这样.我更喜欢做base R.

The alternative is to use dplyr to transform your data, maybe somebody else feels like that. I prefer doing base R.

在下一步中,使用reshape2,后继者是tidyr.但是实际上数据结构是如此简单.如果愿意,可以使用base R:

In the next step you use reshape2, the successor is tidyr. But actually the data structure is so simple. You can use base R if you like:

stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
                      variable=names(unlist(tranche_list)),
                      id=rep(1:nrow(crime_df),each=3))

ggplot(data = stack_df2, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)

附录

我将多个功能与ddply-解决方案进行了比较:

Appendix

I compared multiple functions with the ddply-solution:

plyr_fun<-function(){
  ddply(crime_df, .(id), function(x) {
    tranche <- USArrests_sum[USArrests_sum$arrests >= x$xmin & USArrests_sum$arrests <= x$xmax, ]
    if(nrow(tranche) == 0) return(c(0, 0,0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x$count)
  })
}

apply_fun2<-function(){
  res_mat<-t(apply(crime_df, 1, function(x) {
    tranche <- USArrests_sum[USArrests_sum$arrests >= x['xmin'] & USArrests_sum$arrests <= x['xmax'], ]
    if(nrow(tranche) == 0) return(c(0, 0,0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x['count'])
  }))
  colnames(res_mat)=c("Murder", "Assault", "Rape")
}

lapply_fun3<-function(){
  tranche_list<-lapply(1:nrow(crime_df), function(j) {
    myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
    tranche <- USArrests_sum[myrows,]
    if(nrow(tranche) == 0) return(c(0, 0,0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
  })
  do.call(rbind,tranche_list)
}

lapply_fun<-function(){
  tranche_list<-lapply(1:nrow(crime_df), function(j) {
    myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
    tranche <- USArrests_sum[myrows,]
    if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
  })
}

microbenchmark::microbenchmark(apply_fun2(),lapply_fun3(),lapply_fun(),plyr_fun(),times=1000L)
Unit: milliseconds
          expr    min      lq      mean   median       uq      max neval
  apply_fun2() 5.2307 5.73340  7.169920  6.17165  7.27340  31.5333  1000
 lapply_fun3() 5.3633 5.98930  7.487173  6.40780  7.50115  37.1350  1000
  lapply_fun() 5.4470 5.99295  7.762575  6.43975  7.73060  82.2069  1000
    plyr_fun() 8.8593 9.83850 12.186933 10.54180 12.75880 192.6898  1000

实际上,apply功能甚至比lapply解决方案更快.但是可读性很差.通常,data.table函数比apply系列更快,而dplyr函数运行相对较慢,但可读性好,适合于代码翻译.

Actually the apply-function is even faster than the lapply-solution. But readability is quite bad. Usually data.table-function are faster than the apply family, whereas dplyr-function run comparatively slow but have a good readability and are suitable for code-translations.

只是为了好玩-tidyr与我的基本R解决方案的另一个基准:

Just for fun - another benchmark of tidyr vs my base R solution:

tidyr_fun<-function(){
  crime_tranche<-do.call(rbind,tranche_list)
  stack_df <- gather(data.frame(crime_tranche,id=1:nrow(crime_df)), key=variable,value=value,-id)
}

base_fun<-function(){
  stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
                        variable=names(unlist(tranche_list)),
                        id=rep(1:nrow(crime_df),each=3))
}

microbenchmark::microbenchmark(tidyr_fun(),base_fun())
Unit: microseconds
expr    min      lq     mean  median     uq    max neval
tidyr_fun() 1588.4 1869.45 2516.253 2302.35 2777.9 7671.3   100
base_fun()  286.7  367.40  530.104  454.85  612.8 3675.8   100

# In case you want to verify that the data is the same. identical(stack_df2$id[order(stack_df2$id,stack_df2$variable)],stack_df$id[order(stack_df$id,stack_df$variable)])
identical(stack_df2$value[order(stack_df2$id,stack_df2$variable)],stack_df$value[order(stack_df$id,stack_df$variable)])
identical(as.character(stack_df2$variable[order(stack_df2$id,stack_df2$variable)]),stack_df$variable[order(stack_df$id,stack_df$variable)])

这篇关于使用ggplot2从已汇总的计数中获得3层堆叠的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆