将阴影标准误差曲线添加到ggplot2中的geom_density [英] Add shaded standard error curves to geom_density in ggplot2

查看:274
本文介绍了将阴影标准误差曲线添加到ggplot2中的geom_density的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用ggplot2将阴影标准误差曲线添加到geom_density.我的代码如下:

I would like to add shaded standard error curves to geom_density using ggplot2. My code looks like this:

data.plot <- data.frame(x = c(rnorm(100, mean = 0, sd = 5),
                                    rnorm(100, mean = 1, sd =2 )), 
                             g = factor(c(rep(1, 100),
                                        rep(2,100))))

ggplot(data.plot, aes(x, linetype = g)) + geom_density()

我找不到要这样做的教程或示例.谢谢.

I couldn't find a tutorial or examples to do so. Thank you.

推荐答案

最好的解决方案是使用引导程序,如注释中所述.我将使用经典的 iris 数据,重点是每个 Species Sepal.Length 的密度.

The best solution is with bootstrapping, as mentioned in the comments. I'll use the classic iris data, focusing on the density of Sepal.Length for each Species.

library(dplyr)
library(broom)
library(tidyr)
library(ggplot2)


data_frame(bs = 1:1000) %>% group_by(bs) %>% 
  mutate(data = list(iris %>% group_by(Species) %>% 
                       sample_frac(size = 1, replace = T)))

# A tibble: 1,000 x 2
# Groups:   bs [1,000]
      bs               data
   <int>             <list>
 1     1 <tibble [150 x 5]>
 2     2 <tibble [150 x 5]>
 3     3 <tibble [150 x 5]>
 4     4 <tibble [150 x 5]>
 5     5 <tibble [150 x 5]>
 6     6 <tibble [150 x 5]>
 7     7 <tibble [150 x 5]>
 8     8 <tibble [150 x 5]>
 9     9 <tibble [150 x 5]>
10    10 <tibble [150 x 5]>
# ... with 990 more rows

因此,我仅对原始数据进行了1000次引导复制,并从每个组中取出了与其原始样本大小相同的行数,并进行了替换.现在,我必须unnest来访问嵌套的 data 列中的数据.

So I just made 1000 bootstrap replicates of the original data, taking the same number of rows out of each group as its original sample size, with replacement. Now I have to unnest to access the data in the nested data column.

densities.within <-
data_frame(bs = 1:1000) %>% group_by(bs) %>% 
  mutate(data = list(iris %>% group_by(Species) %>% 
                       sample_frac(size = 1, replace = T))) %>% 
  unnest() %>% 
  group_by(bs, Species) %>% 
  do(tidy(density(.$Sepal.Length, 
                  from = min(iris$Sepal.Length), 
                  to = max(iris$Sepal.Length), 
                  n = 128)))

# A tibble: 384,000 x 4
# Groups:   bs, Species [30,000]
      bs Species        x         y
   <int>  <fctr>    <dbl>     <dbl>
 1     1  setosa 4.300000 0.2395786
 2     1  setosa 4.328346 0.2821128
 3     1  setosa 4.356693 0.3235939
 4     1  setosa 4.385039 0.3632449
 5     1  setosa 4.413386 0.4010378
 6     1  setosa 4.441732 0.4375189
 7     1  setosa 4.470079 0.4734727
 8     1  setosa 4.498425 0.5095333
 9     1  setosa 4.526772 0.5459280
10     1  setosa 4.555118 0.5824587
# ... with 383,990 more rows

因此,我们将数据扩展为长格式,然后在 bs 中在 Species 中每个组的 Sepal.Length 的密度进行了计算.我们必须提供手册from =to =,因为每个引导程序中的最小值和最大值可能会有所不同(并设置比默认值512更低的n =只是为了节省时间). 为了简化生成的S3: density对象,我们使用broom::tidy.这是计算量大的步骤,因此我们将该对象另存为 densities.in .

So we expanded the data to its long form, then took the density of each group's Sepal.Length within Species within bs. We had to supply a manual from = and to = since the min and max within each bootstrap might differ (and set a lower n = than the default 512 just to save time). To simplify the S3: density object that's generated, we use broom::tidy. This is the computationally intensive step, so we'll save this object as densities.within.

这将产生名为 x y 的列,但是我们将重命名它们以匹配我们的数据.然后,我们将得出:对于在每个计算出的 Sepal.Length 中计算出的密度,CI的下限,中位数和CI的上限是多少?我们将使用quantile来获取这些特定的计算密度值.

This results in columns named x and y, but we'll rename these to match our data. Then we'll figure out: for the densities calculated at each computed possible Sepal.Length, what is the lower end of the CI, the median, and the upper end of the CI? We'll use quantile to get these specific values of the computed densities.

densities.qtiles <-
densities.within %>%
  rename(Sepal.Length = x, dens = y) %>%
  ungroup() %>%
  group_by(Species, Sepal.Length) %>% 
  summarise(q05 = quantile(dens, 0.025),
            q50 = quantile(dens, 0.5),
            q95 = quantile(dens, 0.975)) 

# A tibble: 384 x 5
# Groups:   Species [?]
   Species Sepal.Length        q05       q50       q95
    <fctr>        <dbl>      <dbl>     <dbl>     <dbl>
 1  setosa     4.300000 0.05730022 0.2355335 0.4426299
 2  setosa     4.328346 0.08177850 0.2734463 0.4970097
 3  setosa     4.356693 0.09863062 0.3114570 0.5505578
 4  setosa     4.385039 0.12459033 0.3430645 0.5884523
 5  setosa     4.413386 0.15049699 0.3705389 0.6207344
 6  setosa     4.441732 0.17494889 0.4006335 0.6418923
 7  setosa     4.470079 0.19836510 0.4258006 0.6655006
 8  setosa     4.498425 0.21106857 0.4555755 0.6971370
 9  setosa     4.526772 0.23399070 0.4813130 0.7244413
10  setosa     4.555118 0.24863090 0.5108057 0.7708114
# ... with 374 more rows

可视化和比较

ggplot(densities.qtiles, aes(Sepal.Length, q50)) +
  facet_wrap(~Species, nrow = 2) +
  geom_histogram(data = iris, 
                 aes(Sepal.Length, ..density..), 
                 colour = "black", fill = "white", 
                 binwidth = 0.25, boundary = 0) +
  geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") +
  stat_density(data = iris, 
               aes(Sepal.Length, ..density.., color = "raw density"), 
               size = 2, geom = "line") +
  geom_line(size = 1.5, aes(color = "bootstrapped")) + 
  scale_color_manual(values = c("red", "black")) +
  labs(y = "density") +
  theme(legend.position = c(0.5,0),
        legend.justification = c(0,0))

我还为原始数据添加了直方图和密度层,以进行比较.您会看到1000个引导程序样本的中值密度和原始密度非常接近.

I included histogram and density layers for the original data as well for comparison. You can see that the median and raw densities are very close with 1000 bootstrap samples.

这篇关于将阴影标准误差曲线添加到ggplot2中的geom_density的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆