使用ggplot2从已汇总的计数中获得3层堆叠的直方图 [英] 3 layer Stacked histogram from already summarized counts using ggplot2
问题描述
我想要一些帮助为从data.frame中的汇总数据生成的ggplot2直方图着色的方法.
I would like some help coloring a ggplot2 histogram generated from summarized data in a data.frame.
我正在使用的数据集是[R]内置的(USArrests)数据集.
The dataset I'm using is the [R] build in (USArrests) dataset.
我正在尝试调整针对问题,阿伦.
I'm trying to adapt the solution that was given to this question by arun.
理想的结果是制作一个犯罪"的直方图,并根据c(突击",强奸",谋杀")的相对贡献为每个小节上色.
The desired result is to make a histogram of "Crime" and color each bar according to the relative contribution of c("Assault", "Rape", "Murder").
代码:
attach(USArrests)
#Create vector SUM arrests per state
Crime <- with(USArrests, Murder+ Rape+ Assault)
#bind Vector Crime to dataframe USArrets and name it USArrests.transform
USArrests.transform <- cbind (USArrests, Crime)
#See if package is installed, and do if not
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get data of crime plot: cols = count, xmin and xmax
crime.data <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")]
# add a id colum for ddply
crime.data$id <- seq(nrow(crime.data))
#See if package is installed, and do if not
if (!require("plyr")) {
install.packages("plyr")
library(plyr)
}
#Split data frame, apply function en return results in a data frame: ddply
crime.data.transform <- ddply(crime.data, .(id), function(x) {
tranche <- USArrests.transform[USArrests.transform$Crime >= x$xmin & USArrests.transform$Crime <= x$xmax, ]
if(nrow(tranche) == 0) return(c(x$x, 0, 0))
crime.plot <- c(x=x$x, colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["Crime"] * x$count)
})
#See if package is installed, and do if not
if (!require("reshape2")) {
install.packages("reshape2")
library(reshape2)
}
crime.data.transform <- melt(crime.data.transform, id.var="id")
ggplot(data = crime.data.transform, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)
[错误]:上面给出了以下错误:
[Error]: The above gives the following error:
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
随后在整形后部分出现了错误.
Subsequently the are errors in part after the reshape.
在上面的示例中,关于我在做什么错以及如何解决的任何建议?
Any suggestions on what I'm doing wrong and how it could be solved in the above example?
推荐答案
很抱歉,我的回答很长,我想进行一些代码优化.通常,代码不是您的代码,但是即使在arun的代码中,我也发现了一些优化的空间.让我们来看看我所做的更改:
Sorry for the long answer I felt like doing some code optimisation. Mostly the code is not yours, but even in arun's code I found some room for optimisation. Let's go through what I changed:
- 我删除了
attach
语句,因为它不是必需的,并且如果您使用多个数据集,则使用attach
是不好的做法-主要是因为您对数据结构的了解松散 - 如果创建序列且步骤为1,则仅使用
:
而不是seq
.我在此处解释了原因 - 您的代码中的错误:在
return(c(x$x, 0, 0))
中有一个零到零. - 此外,您无需在
ddply
功能内使用x$x
.因此,它应该只是return(c(0,0,0))
,而在下一行中,它应该是c(colSums(tranche)[c("Murder", "Assault", "Rape")]
.否则,R还将绘制所有x
值. - 哎呀!实际上,您实际上不需要这里的
plyr
.此ddply
函数只是对crime.data
-data.frame的行的简单循环.您可以使用lapply
-loop 来实现
- I removed your
attach
statement, because it was not needed and if you work with multiple datasets it is bad practise to useattach
- mainly because you loose track of your data structures - If you create a sequence and the step is 1, just use
:
and notseq
. I explained here why - The error in your code: In
return(c(x$x, 0, 0))
there is one zero to little. - In addition you do not need
x$x
inside theddply
-function. Thus it should just bereturn(c(0,0,0))
and in the next line it needs to bec(colSums(tranche)[c("Murder", "Assault", "Rape")]
. Otherwise R will plot all thex
values as well. - Heck! You actually do not need
plyr
here. Thisddply
-function is just a simple loop over the rows of yourcrime.data
-data.frame. That is something you can achieve using anlapply
-loop
在这里,我可能需要解释一下:plyr
-package试图克服apply
-family-functions的缺点.除了lapply
之外,它们的行为都是不可预测的.特别是sapply
可能会从vector
到list
对象返回从vector
的任何内容.只有lapply
是可靠的-它始终会为您提供list
结果:
Here I maybe need to explain a bit: The plyr
-package tried to overcome the shortcomings of the apply
-family-functions. Except for lapply
, their behaviour is rather unpredictable. Especially sapply
might return anything from vector
over matrix
to list
-objects. Only lapply
is reliable - it always gives you a list
result:
USArrests_sum <- cbind (USArrests, arrests=with(USArrests, Murder+ Rape+ Assault))
#See if package is installed, and do if not
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests_sum, aes(x= arrests)) + geom_histogram()
crime_df <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")] # get data of crime plot: cols = count, xmin and xmax
crime_df$id = 1:nrow(crime_df) #add a id colum for ddply
#Split data frame, apply function en return results in a data frame: ddply
tranche_list<-lapply(1:nrow(crime_df), function(j) {
myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
tranche <- USArrests_sum[myrows,]
if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})
另一种选择是使用dplyr
来转换您的数据,也许其他人会这样.我更喜欢做base R
.
The alternative is to use dplyr
to transform your data, maybe somebody else feels like that. I prefer doing base R
.
在下一步中,使用reshape2
,后继者是tidyr
.但是实际上数据结构是如此简单.如果愿意,可以使用base R
:
In the next step you use reshape2
, the successor is tidyr
. But actually the data structure is so simple. You can use base R
if you like:
stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
variable=names(unlist(tranche_list)),
id=rep(1:nrow(crime_df),each=3))
ggplot(data = stack_df2, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)
附录
我将多个功能与ddply
-解决方案进行了比较:
Appendix
I compared multiple functions with the ddply
-solution:
plyr_fun<-function(){
ddply(crime_df, .(id), function(x) {
tranche <- USArrests_sum[USArrests_sum$arrests >= x$xmin & USArrests_sum$arrests <= x$xmax, ]
if(nrow(tranche) == 0) return(c(0, 0,0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x$count)
})
}
apply_fun2<-function(){
res_mat<-t(apply(crime_df, 1, function(x) {
tranche <- USArrests_sum[USArrests_sum$arrests >= x['xmin'] & USArrests_sum$arrests <= x['xmax'], ]
if(nrow(tranche) == 0) return(c(0, 0,0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x['count'])
}))
colnames(res_mat)=c("Murder", "Assault", "Rape")
}
lapply_fun3<-function(){
tranche_list<-lapply(1:nrow(crime_df), function(j) {
myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
tranche <- USArrests_sum[myrows,]
if(nrow(tranche) == 0) return(c(0, 0,0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})
do.call(rbind,tranche_list)
}
lapply_fun<-function(){
tranche_list<-lapply(1:nrow(crime_df), function(j) {
myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
tranche <- USArrests_sum[myrows,]
if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})
}
microbenchmark::microbenchmark(apply_fun2(),lapply_fun3(),lapply_fun(),plyr_fun(),times=1000L)
Unit: milliseconds
expr min lq mean median uq max neval
apply_fun2() 5.2307 5.73340 7.169920 6.17165 7.27340 31.5333 1000
lapply_fun3() 5.3633 5.98930 7.487173 6.40780 7.50115 37.1350 1000
lapply_fun() 5.4470 5.99295 7.762575 6.43975 7.73060 82.2069 1000
plyr_fun() 8.8593 9.83850 12.186933 10.54180 12.75880 192.6898 1000
实际上,apply
功能甚至比lapply
解决方案更快.但是可读性很差.通常,data.table
函数比apply
系列更快,而dplyr
函数运行相对较慢,但可读性好,适合于代码翻译.
Actually the apply
-function is even faster than the lapply
-solution. But readability is quite bad. Usually data.table
-function are faster than the apply
family, whereas dplyr
-function run comparatively slow but have a good readability and are suitable for code-translations.
只是为了好玩-tidyr
与我的基本R解决方案的另一个基准:
Just for fun - another benchmark of tidyr
vs my base R solution:
tidyr_fun<-function(){
crime_tranche<-do.call(rbind,tranche_list)
stack_df <- gather(data.frame(crime_tranche,id=1:nrow(crime_df)), key=variable,value=value,-id)
}
base_fun<-function(){
stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
variable=names(unlist(tranche_list)),
id=rep(1:nrow(crime_df),each=3))
}
microbenchmark::microbenchmark(tidyr_fun(),base_fun())
Unit: microseconds
expr min lq mean median uq max neval
tidyr_fun() 1588.4 1869.45 2516.253 2302.35 2777.9 7671.3 100
base_fun() 286.7 367.40 530.104 454.85 612.8 3675.8 100
# In case you want to verify that the data is the same. identical(stack_df2$id[order(stack_df2$id,stack_df2$variable)],stack_df$id[order(stack_df$id,stack_df$variable)])
identical(stack_df2$value[order(stack_df2$id,stack_df2$variable)],stack_df$value[order(stack_df$id,stack_df$variable)])
identical(as.character(stack_df2$variable[order(stack_df2$id,stack_df2$variable)]),stack_df$variable[order(stack_df$id,stack_df$variable)])
这篇关于使用ggplot2从已汇总的计数中获得3层堆叠的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!