分组时覆盖箱图中的下部，上部等 [英] Override lower, upper, etc. in boxplot while grouping

查看：55 发布时间：2020/9/23 2:24:41 r ggplot2 boxplot

本文介绍了分组时覆盖箱图中的下部，上部等的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

默认情况下，对于 geom_boxplot 中的较低，中部和较高分位数，将考虑25％-，50％-和75％的分位数。这些是根据 y 计算得出的，但可以通过美学参数 lower ， upper手动设置，中间（还提供 x ， ymin 和 ymax 并设置 stat = identity ）。

但是，这样做会产生一些不良影响（请参见示例代码中的版本1）：

参数 group 被忽略，因此在计算中会考虑列的所有值（例如，在计算每个组的最低分位数时）

将所得的相同箱形图按 x 分组，并在组中重复出现的次数与数据中出现的特定组值相同（而不是

未绘制离群值

通过预先计算所需的值并将其存储在新的数据框中，则可以处理前两点（请参见示例代码中的版本2），而第三点则通过识别异常值并通过<$分别添加到图表中来进行固定c $ c> geom_point 。

是否存在更直接的方法来更改分位数，而不会产生这些不良影响？

示例代码：

  set.seed（12）
 
＃B中的随机数据，按A 
u中的值1-4分组-data.frame（A = sample.int（4，100，replace = TRUE），B = rnorm（100））
 
＃期望的参数
 qymax <-0.9 
 qymin <-0.1 
 qmiddle <-0.5 
 qupper <-0.8 
 qlower <-0.2

版本1：A中每个值的重复箱形图，按A分组

  ggplot（u，aes（x = A，y = B））+ 
 geom_boxplot（aes（group = A，
 lower =分位数（B ，qlower），
 upper =分位数（B，qupper），
 middle =分位数（B，qmiddle），
 ymin =分位数（B，qymin），
 ymax =分位数（B，qymax）），
 stat = identity）

版本2：计算每个组首先使用参数。基本R解决方案

  Bgrouped<-lapply（唯一（u $ A），函数（a）u $ B [u $ A == a]）
 .lower<-sapply（Bgrouped，function（x）分位数（x，qlower））
 .upper<--sapply（Bgrouped，function（x）分位数，qupper））
 .middle<-sapply（Bgrouped，function（x）分位数（x，qmiddle））
 .ymin<-sapply（Bgrouped，function（x）分位数（x，qymin ））
 .ymax<-sapply（Bgrouped，function（x）分位数（x，qymax））
 
u<-data.frame（A = unique（u $ A）， 
下部=。下部，
上部=。上部，
中间=。中部，
 ymin = .ymin，
 ymax = .ymax）
 
 ggplot（u，aes（x = A））+ 
 geom_boxplot（aes（lower = lower，upper = Upper，
 Middle = Middle，ymin = ymin，ymax = ymax），
 stat = identity）

解决方案

没什么如果没有 lot ，我真的会做人们通常期望盒装图的最小/最大/盒装值对应于相同的分位数位置，但是可以做到。

使用的数据（添加了极端值以显示异常值）：

  set.seed（12）
u<-数据.frame（A = sample.int（4，100，replace = TRUE），B = rnorm（100））
u $ B [c（30，70，76）]<-c（4，-4 ，-5）

解决方案1 ：您可以预先计算值无需绕过基本R路线，&在同一步骤中包含离群值的计算。我会在Hadley的tidyverse库中完全做到这一点，我发现它更整洁：

  library（dplyr）
库（ tidyr）
 
u％>％
 group_by（A）％>％
 summarise（较低=分位数（B，qlower），
较高=分位数（B ，qupper），
 middle =分位数（B，qmiddle），
 IQR = diff（c（lower，upper）），
 ymin = max（quantile（B，qymin），lower- 1.5 * IQR），
 ymax = min（分位数（B，qymax），上限+ 1.5 * IQR），
离群值= list（B [哪个（B>上限+ 1.5 * IQR | 
 B< lower-1.5 * IQR）]））％&％;％
 ungroup（）％&％;％
 ggplot（aes（x = A））+ 
 geom_boxplot（aes （较低=较低，较高=较高，
中间=中间，ymin = ymin，ymax = ymax），
 stat = identity）+ 
 geom_point（data =。％>％ 
过滤器（sapply（异常值，长度）> 0）％>％
s当选（A，离群值）％&％;％
 unnest（），
 aes（y = unlist（离群值）））

解决方案2 ：您可以覆盖ggplot使用的实际分位数规格。 geom_boxplot（）的分位数的计算实际上在 StatBoxplot 的 compute_group（）中函数，在

请注意，更改定义时，影响环境中的每个ggplot对象。因此，如果您在定义更改之前之前创建了ggplot箱线图对象，则& 之后将其打印出来，箱线图将遵循新的定义。（对于上面的并排比较，我必须立即将每个ggplot转换为grob对象，以保持差异。）

Per default, for the lower, middle and upper quantile in geom_boxplot the 25%-, 50%-, and 75%-quantiles are considered. These are computed from y, but can be set manually via the aesthetic arguments lower, upper, middle (providing also x, ymin and ymax and setting stat="identity").

However, doing so, several undesirable effects occur (cf. version 1 in the example code):

The argument group is ignored, so all values of a column are considered in calculations (for instance when computing the lowest quantile for each group)

The resulting identical boxplots are grouped by x, and repeated within the group as often as the specific group value occurs in the data (instead of merging the boxes to a wider one)

outliers are not plotted

By pre-computing the desired values and storing them in a new data frame, one can handle the first two points (cf. version 2 in the example code), while the third point is fixed by identifying the outliers and adding them separately to the chart via geom_point.

Is there a more straight forward way to have the quantiles changed, without having these undesired effects?

Example Code:
set.seed(12) # Random data in B, grouped by values 1 to 4 in A u <- data.frame(A = sample.int(4, 100, replace = TRUE), B = rnorm(100)) # Desired arguments qymax <- 0.9 qymin <- 0.1 qmiddle <- 0.5 qupper <- 0.8 qlower <- 0.2
Version 1: Repeated boxplots per value in A, grouped by A
ggplot(u, aes(x = A, y = B)) + geom_boxplot(aes(group=A, lower = quantile(B, qlower), upper = quantile(B, qupper), middle = quantile(B, qmiddle), ymin = quantile(B, qymin), ymax = quantile(B, qymax) ), stat="identity")
Version 2: Compute the arguments first for each group. Base R solution
Bgrouped <- lapply(unique(u$A), function(a) u$B[u$A == a]) .lower <- sapply(Bgrouped, function(x) quantile(x, qlower)) .upper <- sapply(Bgrouped, function(x) quantile(x, qupper)) .middle <- sapply(Bgrouped, function(x) quantile(x, qmiddle)) .ymin <- sapply(Bgrouped, function(x) quantile(x, qymin)) .ymax <- sapply(Bgrouped, function(x) quantile(x, qymax)) u <- data.frame(A = unique(u$A), lower = .lower, upper = .upper, middle = .middle, ymin = .ymin, ymax = .ymax) ggplot(u, aes(x = A)) + geom_boxplot(aes(lower = lower, upper = upper, middle = middle, ymin = ymin, ymax = ymax ), stat="identity")

解决方案
It's not something I'd really do without a lot of justification, as people typically expect the boxplot's min / max / box values to correspond to the same quantile positions, but it can be done.

Data used (with extreme values added to demonstrate outliers):
set.seed(12) u <- data.frame(A = sample.int(4, 100, replace = TRUE), B = rnorm(100)) u$B[c(30, 70, 76)] <- c(4, -4, -5)
Solution 1: You can pre-compute the values without going by the base R route, & include calculations for outliers in the same step. I'd do it completely within Hadley's tidyverse libraries, which I find neater:
library(dplyr) library(tidyr) u %>% group_by(A) %>% summarise(lower = quantile(B, qlower), upper = quantile(B, qupper), middle = quantile(B, qmiddle), IQR = diff(c(lower, upper)), ymin = max(quantile(B, qymin), lower - 1.5 * IQR), ymax = min(quantile(B, qymax), upper + 1.5 * IQR), outliers = list(B[which(B > upper + 1.5 * IQR | B < lower - 1.5 * IQR)])) %>% ungroup() %>% ggplot(aes(x = A)) + geom_boxplot(aes(lower = lower, upper = upper, middle = middle, ymin = ymin, ymax = ymax ), stat="identity") + geom_point(data = . %>% filter(sapply(outliers, length) > 0) %>% select(A, outliers) %>% unnest(), aes(y = unlist(outliers)))

Solution 2: You can override the actual quantile specifications used by ggplot. The calculations for geom_boxplot()'s quantiles are actually in StatBoxplot's compute_group() function, found here:
compute_group = function(data, scales, width = NULL, na.rm = FALSE, coef = 1.5) { qs <- c(0, 0.25, 0.5, 0.75, 1) if (!is.null(data$weight)) { mod <- quantreg::rq(y ~ 1, weights = weight, data = data, tau = qs) stats <- as.numeric(stats::coef(mod)) } else { stats <- as.numeric(stats::quantile(data$y, qs)) } ... (omitted for space)
The qs vector defines the quantile positions. It's not affected by parameters passed to compute_group(), so the only way to change that is to change the definition for compute_group() itself:
# save a copy of the original function, in case you need to revert original.function <- environment(ggplot2::StatBoxplot$compute_group)$f # define new function (only the first line for qs is changed, but you'll have to # copy & paste the whole thing) new.function <- function (data, scales, width = NULL, na.rm = FALSE, coef = 1.5) { qs <- c(0.1, 0.2, 0.5, 0.8, 0.9) if (!is.null(data$weight)) { mod <- quantreg::rq(y ~ 1, weights = weight, data = data, tau = qs) stats <- as.numeric(stats::coef(mod)) } else { stats <- as.numeric(stats::quantile(data$y, qs)) } names(stats) <- c("ymin", "lower", "middle", "upper", "ymax") iqr <- diff(stats[c(2, 4)]) outliers <- data$y < (stats[2] - coef * iqr) | data$y > (stats[4] + coef * iqr) if (any(outliers)) { stats[c(1, 5)] <- range(c(stats[2:4], data$y[!outliers]), na.rm = TRUE) } if (length(unique(data$x)) > 1) width <- diff(range(data$x)) * 0.9 df <- as.data.frame(as.list(stats)) df$outliers <- list(data$y[outliers]) if (is.null(data$weight)) { n <- sum(!is.na(data$y)) } else { n <- sum(data$weight[!is.na(data$y) & !is.na(data$weight)]) } df$notchupper <- df$middle + 1.58 * iqr/sqrt(n) df$notchlower <- df$middle - 1.58 * iqr/sqrt(n) df$x <- if (is.factor(data$x)) data$x[1] else mean(range(data$x)) df$width <- width df$relvarwidth <- sqrt(n) df }
Result:
# toggle between the two definitions environment(StatBoxplot$compute_group)$f <- original.function ggplot(u, aes(x = A, y = B, group = A)) + geom_boxplot() + ggtitle("original definition for calculated quantiles") environment(StatBoxplot$compute_group)$f <- new.function ggplot(u, aes(x = A, y = B, group = A)) + geom_boxplot() + ggtitle("new definition for calculated quantiles")

Do note that when you change the definition, it affects every ggplot object in your environment. So if you've created a ggplot boxplot object before the definition change, & print it out afterwards, the boxplot will follow the new definition. (For the side-by-side comparison above, I had to convert each ggplot to a grob object immediately, in order to preserve the difference.)

这篇关于分组时覆盖箱图中的下部，上部等的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分组时覆盖箱图中的下部，上部等 [英] Override lower, upper, etc. in boxplot while grouping

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

分组时覆盖箱图中的下部，上部等 [英] Override lower, upper, etc. in boxplot while grouping

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭