如何使用geom_boxplot(stat ="identity")与异常值模拟geom_boxplot() [英] How to mimic geom_boxplot() with outliers using geom_boxplot(stat = "identity")

查看:125
本文介绍了如何使用geom_boxplot(stat ="identity")与异常值模拟geom_boxplot()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想预先计算数据的变量汇总(使用 plyr 并传递 quantile 函数),然后使用 geom_boxplot(stat =身份").这样做非常有效,除了(a)不会绘制离群点,并且(b)将晶须"扩展到所绘制数据的最大值和最小值.

I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity"). This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.

示例:

library(plyr)
library(ggplot2)

set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
                 val  = c(1:10, 100, 101))
df
#    fact val
# 1     b   1
# 2     a   2
# 3     a   3
# 4     a   4
# 5     b   5
# 6     a   6
# 7     b   7
# 8     b   8
# 9     b   9
# 10    a  10
# 11    b 100
# 12    a 101

by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))

by.fact.df
#   fact 0%  25% 50%  75% 100%
# 1    a  2 3.25 5.0 9.00  101
# 2    b  1 5.50 7.5 8.75  100

# What I can do...with faults (a) and (b) above
ggplot(by.fact.df, 
       aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`, 
           upper = `75%`,  ymax = `100%`)) +
  geom_boxplot(stat = "identity")

# What I want...
ggplot(df, aes(x = fact, y = val)) +
  geom_boxplot()

我该怎么办...遇到上述故障(a)和(b):

What I can do...with faults (a) and (b) mentioned above:

我想获得什么,但仍然可以通过 plyr (或其他方法)利用预计算:

What I would like to obtain, but still leverage pre-computation via plyr (or other method):

初步思想:也许有某种方法可以预先计算晶须的真实终点而没有异常值?然后,将数据作为异常值的子集并将其作为 geom_point()?

Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers? Then, subset the data for outliers and pass them as geom_point()?

动机:使用大型数据集时,我发现利用 plyr dplyr 和/或 data.table 来预先计算统计信息,然后绘制它们,而不用 ggplot2 进行计算.

Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr, dplyr, and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.

我可以使用以下 dplyr plyr 代码的组合来提取所需的内容,但是我不确定这是否是最有效的方法:

I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:

df %>%
  group_by(fact) %>%
  do(ldply(boxplot.stats(.$val), data.frame))

Source: local data frame [6 x 3]
Groups: fact

  fact   .id X..i..
1    a stats      2
2    a stats      4
3    a stats     10
4    a stats     13
5    a stats     16
6    a     n      9

推荐答案

这是我的答案,使用内置函数 quantile boxplot.stats .

Here's my answer, using built-in functions quantile and boxplot.stats.

geom_boxplot 对boxplot的计算与 boxplot.stats 略有不同.阅读?geom_boxplot ?boxplot.stats 来了解我在下面的实现

geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats. Read ?geom_boxplot and ?boxplot.stats to understand my implementation below

#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
        quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
        labels <-names(quantile(x))
        #replacing the upper whisker to geom_boxplot
        quantiles[5] <-boxplot.stats(x)$stats[5]
        res <-data.frame(rbind(quantiles))
        names(res) <-labels
        res$out <-boxplot.stats(x)$out
        return(res)
    }

计算统计数据并将其绘制的代码

Code to calculate the stats and plot it

library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>% 
      ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
                 upper = `75%`,  ymax = `100%`)) +
      geom_boxplot(stat = "identity") + geom_point()

这篇关于如何使用geom_boxplot(stat ="identity")与异常值模拟geom_boxplot()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆