分组时覆盖箱图中的下部,上部等 [英] Override lower, upper, etc. in boxplot while grouping

查看:55
本文介绍了分组时覆盖箱图中的下部,上部等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

默认情况下,对于 geom_boxplot 中的较低,中部和较高分位数,将考虑25%-,50%-和75%的分位数。这些是根据 y 计算得出的,但可以通过美学参数 lower upper手动设置中间(还提供 x ymin ymax 并设置 stat = identity )。



但是,这样做会产生一些不良影响(请参见示例代码中的版本1):




  • 参数 group 被忽略,因此在计算中会考虑列的所有值(例如,在计算每个组的最低分位数时)

  • 将所得的相同箱形图按 x 分组,并在组中重复出现的次数与数据中出现的特定组值相同(而不是

  • 未绘制离群值



通过预先计算所需的值并将其存储在新的数据框中,则可以处理前两点(请参见示例代码中的版本2),而第三点则通过识别异常值并通过<$分别添加到图表中来进行固定c $ c> geom_point 。



是否存在更直接的方法来更改分位数,而不会产生这些不良影响?



示例代码:

  set.seed(12)

#B中的随机数据,按A
u中的值1-4分组-data.frame(A = sample.int(4,100,replace = TRUE),B = rnorm(100))

#期望的参数
qymax <-0.9
qymin <-0.1
qmiddle <-0.5
qupper <-0.8
qlower <-0.2

版本1:A中每个值的重复箱形图,按A分组

  ggplot(u,aes(x = A,y = B))+ 
geom_boxplot(aes(group = A,
lower =分位数(B ,qlower),
upper =分位数(B,qupper),
middle =分位数(B,qmiddle),
ymin =分位数(B,qymin),
ymax =分位数(B,qymax)),
stat = identity)

版本2:计算每个组首先使用参数。基本R解决方案

  Bgrouped<-lapply(唯一(u $ A),函数(a)u $ B [u $ A == a])
.lower<-sapply(Bgrouped,function(x)分位数(x,qlower))
.upper<--sapply(Bgrouped,function(x)分位数,qupper))
.middle<-sapply(Bgrouped,function(x)分位数(x,qmiddle))
.ymin<-sapply(Bgrouped,function(x)分位数(x,qymin ))
.ymax<-sapply(Bgrouped,function(x)分位数(x,qymax))

u<-data.frame(A = unique(u $ A),
下部=。下部,
上部=。上部,
中间=。中部,
ymin = .ymin,
ymax = .ymax)

ggplot(u,aes(x = A))+
geom_boxplot(aes(lower = lower,upper = Upper,
Middle = Middle,ymin = ymin,ymax = ymax),
stat = identity)


解决方案

没什么如果没有 lot ,我真的会做人们通常期望盒装图的最小/最大/盒装值对应于相同的分位数位置,但是可以做到。



使用的数据(添加了极端值以显示异常值):

  set.seed(12)
u<-数据.frame(A = sample.int(4,100,replace = TRUE),B = rnorm(100))
u $ B [c(30,70,76)]<-c(4,-4 ,-5)

解决方案1 ​​:您可以预先计算值无需绕过基本R路线,&在同一步骤中包含离群值的计算。我会在Hadley的tidyverse库中完全做到这一点,我发现它更整洁:

  library(dplyr)
库( tidyr)

u%>%
group_by(A)%>%
summarise(较低=分位数(B,qlower),
较高=分位数(B ,qupper),
middle =分位数(B,qmiddle),
IQR = diff(c(lower,upper)),
ymin = max(quantile(B,qymin),lower- 1.5 * IQR),
ymax = min(分位数(B,qymax),上限+ 1.5 * IQR),
离群值= list(B [哪个(B>上限+ 1.5 * IQR |
B< lower-1.5 * IQR)]))%&%;%
ungroup()%&%;%
ggplot(aes(x = A))+
geom_boxplot(aes (较低=较低,较高=较高,
中间=中间,ymin = ymin,ymax = ymax),
stat = identity)+
geom_point(data =。%>%
过滤器(sapply(异常值,长度)> 0)%>%
s当选(A,离群值)%&%;%
unnest(),
aes(y = unlist(离群值)))



解决方案2 :您可以覆盖ggplot使用的实际分位数规格。 geom_boxplot()的分位数的计算实际上在 StatBoxplot compute_group()中函数,在



请注意,更改定义时,影响环境中的每个ggplot对象。因此,如果您在定义更改之前之前创建了ggplot箱线图对象,则& 之后将其打印出来,箱线图将遵循新的定义。 (对于上面的并排比较,我必须立即将每个ggplot转换为grob对象,以保持差异。)


Per default, for the lower, middle and upper quantile in geom_boxplot the 25%-, 50%-, and 75%-quantiles are considered. These are computed from y, but can be set manually via the aesthetic arguments lower, upper, middle (providing also x, ymin and ymax and setting stat="identity").

However, doing so, several undesirable effects occur (cf. version 1 in the example code):

  • The argument group is ignored, so all values of a column are considered in calculations (for instance when computing the lowest quantile for each group)
  • The resulting identical boxplots are grouped by x, and repeated within the group as often as the specific group value occurs in the data (instead of merging the boxes to a wider one)
  • outliers are not plotted

By pre-computing the desired values and storing them in a new data frame, one can handle the first two points (cf. version 2 in the example code), while the third point is fixed by identifying the outliers and adding them separately to the chart via geom_point.

Is there a more straight forward way to have the quantiles changed, without having these undesired effects?

Example Code:

set.seed(12)

# Random data in B, grouped by values 1 to 4 in A
u <- data.frame(A = sample.int(4, 100, replace = TRUE), B = rnorm(100))

# Desired arguments
qymax <- 0.9
qymin <- 0.1
qmiddle <- 0.5
qupper <- 0.8
qlower <- 0.2

Version 1: Repeated boxplots per value in A, grouped by A

ggplot(u, aes(x = A, y = B)) + 
  geom_boxplot(aes(group=A, 
                   lower = quantile(B, qlower), 
                   upper = quantile(B, qupper), 
                   middle = quantile(B, qmiddle), 
                   ymin = quantile(B, qymin), 
                   ymax = quantile(B, qymax) ), 
               stat="identity")

Version 2: Compute the arguments first for each group. Base R solution

Bgrouped <- lapply(unique(u$A), function(a) u$B[u$A == a])
.lower <- sapply(Bgrouped, function(x) quantile(x, qlower))
.upper <- sapply(Bgrouped, function(x) quantile(x, qupper))
.middle <- sapply(Bgrouped, function(x) quantile(x, qmiddle))
.ymin <- sapply(Bgrouped, function(x) quantile(x, qymin))
.ymax <- sapply(Bgrouped, function(x) quantile(x, qymax))

u <- data.frame(A = unique(u$A), 
                lower = .lower, 
                upper = .upper, 
                middle = .middle, 
                ymin = .ymin, 
                ymax = .ymax)    

ggplot(u, aes(x = A)) + 
  geom_boxplot(aes(lower = lower, upper = upper, 
                   middle = middle, ymin = ymin, ymax = ymax ), 
               stat="identity")

解决方案

It's not something I'd really do without a lot of justification, as people typically expect the boxplot's min / max / box values to correspond to the same quantile positions, but it can be done.

Data used (with extreme values added to demonstrate outliers):

set.seed(12)
u <- data.frame(A = sample.int(4, 100, replace = TRUE), B = rnorm(100))
u$B[c(30, 70, 76)] <- c(4, -4, -5)

Solution 1: You can pre-compute the values without going by the base R route, & include calculations for outliers in the same step. I'd do it completely within Hadley's tidyverse libraries, which I find neater:

library(dplyr)
library(tidyr)

u %>%
  group_by(A) %>%
  summarise(lower = quantile(B, qlower),
            upper = quantile(B, qupper), 
            middle = quantile(B, qmiddle), 
            IQR = diff(c(lower, upper)),
            ymin = max(quantile(B, qymin), lower - 1.5 * IQR), 
            ymax = min(quantile(B, qymax), upper + 1.5 * IQR),
            outliers = list(B[which(B > upper + 1.5 * IQR | 
                                      B < lower - 1.5 * IQR)])) %>%
  ungroup() %>% 
  ggplot(aes(x = A)) + 
  geom_boxplot(aes(lower = lower, upper = upper,
                   middle = middle, ymin = ymin, ymax = ymax ),
               stat="identity") + 
  geom_point(data = . %>% 
               filter(sapply(outliers, length) > 0) %>%
               select(A, outliers) %>%
               unnest(), 
             aes(y = unlist(outliers)))

Solution 2: You can override the actual quantile specifications used by ggplot. The calculations for geom_boxplot()'s quantiles are actually in StatBoxplot's compute_group() function, found here:

compute_group = function(data, scales, width = NULL, na.rm = FALSE, coef = 1.5) {
    qs <- c(0, 0.25, 0.5, 0.75, 1)

    if (!is.null(data$weight)) {
      mod <- quantreg::rq(y ~ 1, weights = weight, data = data, tau = qs)
      stats <- as.numeric(stats::coef(mod))
    } else {
      stats <- as.numeric(stats::quantile(data$y, qs))
    }

... (omitted for space)

The qs vector defines the quantile positions. It's not affected by parameters passed to compute_group(), so the only way to change that is to change the definition for compute_group() itself:

# save a copy of the original function, in case you need to revert
original.function <- environment(ggplot2::StatBoxplot$compute_group)$f

# define new function (only the first line for qs is changed, but you'll have to
# copy & paste the whole thing)
new.function <- function (data, scales, width = NULL, na.rm = FALSE, coef = 1.5) {
  qs <- c(0.1, 0.2, 0.5, 0.8, 0.9)
  if (!is.null(data$weight)) {
    mod <- quantreg::rq(y ~ 1, weights = weight, data = data, 
                        tau = qs)
    stats <- as.numeric(stats::coef(mod))
  }
  else {
    stats <- as.numeric(stats::quantile(data$y, qs))
  }
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  iqr <- diff(stats[c(2, 4)])
  outliers <- data$y < (stats[2] - coef * iqr) | data$y > (stats[4] + 
                                                             coef * iqr)
  if (any(outliers)) {
    stats[c(1, 5)] <- range(c(stats[2:4], data$y[!outliers]), 
                            na.rm = TRUE)
  }
  if (length(unique(data$x)) > 1) 
    width <- diff(range(data$x)) * 0.9
  df <- as.data.frame(as.list(stats))
  df$outliers <- list(data$y[outliers])
  if (is.null(data$weight)) {
    n <- sum(!is.na(data$y))
  }
  else {
    n <- sum(data$weight[!is.na(data$y) & !is.na(data$weight)])
  }
  df$notchupper <- df$middle + 1.58 * iqr/sqrt(n)
  df$notchlower <- df$middle - 1.58 * iqr/sqrt(n)
  df$x <- if (is.factor(data$x)) 
    data$x[1]
  else mean(range(data$x))
  df$width <- width
  df$relvarwidth <- sqrt(n)
  df
}

Result:

# toggle between the two definitions
environment(StatBoxplot$compute_group)$f <- original.function
ggplot(u, aes(x = A, y = B, group = A)) +
  geom_boxplot() +
  ggtitle("original definition for calculated quantiles")

environment(StatBoxplot$compute_group)$f <- new.function
ggplot(u, aes(x = A, y = B, group = A)) +
  geom_boxplot() +
  ggtitle("new definition for calculated quantiles")

Do note that when you change the definition, it affects every ggplot object in your environment. So if you've created a ggplot boxplot object before the definition change, & print it out afterwards, the boxplot will follow the new definition. (For the side-by-side comparison above, I had to convert each ggplot to a grob object immediately, in order to preserve the difference.)

这篇关于分组时覆盖箱图中的下部,上部等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆