在ggplot中的构面的最小值或最大值处将样本大小添加到箱形图中 [英] Adding sample size to a box plot at the min or max of the facet in ggplot

查看:108
本文介绍了在ggplot中的构面的最小值或最大值处将样本大小添加到箱形图中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多解释,包括

解决方案

您可以使用 ggplot_build 检查ggplot对象的结构,尤其是x和y面板范围存储在布局中.将图分配给一个对象并查看结构:

  gg<-ggplot(mtcars,aes(factor(cyl),mpg,label = rownames(mtcars)))+geom_boxplot(fill ="grey80",color =#3366FF")+stat_summary(fun.data = Give.n,geom ="text",fun.y =中位数)+stat_summary(fun.data = mean.n,geom ="text",fun.y = mean,color ="red")+facet_grid(cyl〜.,scale ="free_y")ggplot_build(gg) 

您将特别感兴趣:

  ggplot_build(gg)$ layout $ panel_ranges 

3个面板的ylim分别为c(ymin,ymax),并存储在以下位置:

  ggplot_build(gg)$ layout $ panel_ranges [[1]] $ y.rangeggplot_build(gg)$ layout $ panel_ranges [[2]] $ y.rangeggplot_build(gg)$ layout $ panel_ranges [[3]] $ y.range 

经过编辑以回应评论以及如何将此布局信息合并到绘图中.在这里,我们使用dplyr分别计算按 cyl 分组的统计摘要,并创建单独的数据帧以合并到ggplot2中,而不是使用 stat_summary .

 库(dplyr)gg.summary<-group_by(mtcars,cyl)%>%summarise(平均值=平均值(mpg),中位数=中位数(mpg),长度=长度(mpg)) 

解析ylim范围并将其包括在统计摘要df中,统计摘要df按cyl分组,这是我们要介绍的变量:

  gg.summary $ panel.ylim<-sapply(order(levels(factor(mtcars $ cyl))),function(x)ggplot_build(gg)$ layout $ panel_ranges [[x]] $ y.range [1])##小动作:3 x 5#cyl平均中位长度panel.ylim#< dbl>< dbl>< dbl>< int>< dbl>#1 4 26.66364 26.0 11 20.775#2 6 19.74286 19.7 7 17.620#3 8 15.10000 15.2 14 9.960 

在ggplot中使用,我相信这是您想要的图:

  gg + geom_text(data = gg.summary,(aes(x = factor(cyl),y = panel.ylim,label = paste("n =",length)))))+geom_text(data = gg.summary,(aes(x = factor(cyl),y = median * 0.97,label = format(median,nsmall = 2)))) 

There are plenty of explanations, including this good one, of how to label box plots with sample size. All of them seem to use max(x) or median(x) to position the sample size.

I'm wondering if there is a way to easily position the labels at the top or bottom of the plot, especially when using the scale = "free_y" command in facet where the max and minimum value for the axis is picked automatically for each facet by ggplot.

The reason is that I am creating multiple facets where the distributions are narrow and the facets are small. It would be easier to read the sample size if it were positioned at the top or bottom of the plot...but I'd like to use "free_y" because there are meaningful differences in some facets that are obscured by the facets that have much larger spans in the data.

Using a slightly modified example from the linked post:

# function for number of observations 
give.n <- function(x){
  return(c(y = median(x)*1.05, label = length(x))) 
  # experiment with the multiplier to find the perfect position
}

# function for mean labels
mean.n <- function(x){
  return(c(y = median(x)*0.97, label = round(mean(x),2))) 
  # experiment with the multiplier to find the perfect position
}

# plot
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
  stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red") +
  facet_grid(cyl~., scale="free_y")

Given this setup, how could I find the min or max of the x axis for each facet and position the sample size there instead of at the median, min or max of each box-and-whisker?

EDIT

I'm updating the question with information from R.S.'s answer below. It's still not answered yet, but their suggestion provides a solution for where to find this information.

ggplot_build(gg)$layout$panel_ranges[[order(levels(factor(mtcars$cyl)))[1]]]$y.range[1]

gives the minimum of the y range for the first factor of mtcars$cyl. So, by my logic, we need to build the plot, without the stat_summary statements, then find the sample size and minimum y-range using the give.n function. After that, we can add the stat_summary statement to the plot...like below:

# plot
gg = ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  facet_grid(cyl~., scale="free_y")

# function for number of observations 
give.n <- function(x){
  return(c(y = ggplot_build(gg)$layout$panel_ranges[[order(levels(factor(mtcars$cyl)))[x]]]$y.range[1], label = length(x))) 
  # experiment with the multiplier to find the perfect position
}

gg +
  stat_summary(fun.data = give.n, geom = "text", fun.y = "median")

But...the above code doesn't work because I don't really understand what the give.n function is iterating over. Replacing [[x]] with any of 1:3 plots all the sample sizes at the minimum for that facet, so that is progress.

Here is the plot using [[2]], so all sample sizes are plotted at 17.62, the minimum value of the range for the second facet.

解决方案

You can examine the structure of the ggplot object using ggplot_build, in particular the x and y panel ranges are stored in layout. Assign your plot to an object and look at the structure:

gg <- ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
  stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red") +
  facet_grid(cyl~., scale="free_y")

  ggplot_build(gg)

In particular you will be interested in:

  ggplot_build(gg)$layout$panel_ranges

The ylim of the 3 panels are given as c(ymin, ymax) and stored under:

 ggplot_build(gg)$layout$panel_ranges[[1]]$y.range
 ggplot_build(gg)$layout$panel_ranges[[2]]$y.range
 ggplot_build(gg)$layout$panel_ranges[[3]]$y.range

Edited to respond to comment and how to incorporate this layout info into the plot. Here we calculate the stat summaries grouped by cyl separately using dplyr, and create separate data frame to incorporate into ggplot2, instead of using stat_summary.

 library(dplyr)
 gg.summary <- group_by(mtcars, cyl) %>% summarise(mean=mean(mpg), median=median(mpg), length=length(mpg))

Parse the the ylim ranges and include into the stat summary df, the stat summary df is grouped by cyl which is the variable we are faceting:

 gg.summary$panel.ylim <- sapply(order(levels(factor(mtcars$cyl))), function(x) ggplot_build(gg)$layout$panel_ranges[[x]]$y.range[1])
 # # A tibble: 3 x 5
 # cyl     mean median length panel.ylim
 # <dbl>    <dbl>  <dbl>  <int>      <dbl>
 # 1     4 26.66364   26.0     11     20.775
 # 2     6 19.74286   19.7      7     17.620
 # 3     8 15.10000   15.2     14      9.960

Use in ggplot, I believe this is the plot you want:

 gg + geom_text(data=gg.summary, (aes(x=factor(cyl), y=panel.ylim, label=paste("n =",length)))) +
   geom_text(data=gg.summary, (aes(x=factor(cyl), y=median*0.97, label=format(median, nsmall=2))))

这篇关于在ggplot中的构面的最小值或最大值处将样本大小添加到箱形图中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆