将阴影标准误差曲线添加到ggplot2中的geom_density [英] Add shaded standard error curves to geom_density in ggplot2
问题描述
我想使用ggplot2
将阴影标准误差曲线添加到geom_density
.我的代码如下:
I would like to add shaded standard error curves to geom_density
using ggplot2
. My code looks like this:
data.plot <- data.frame(x = c(rnorm(100, mean = 0, sd = 5),
rnorm(100, mean = 1, sd =2 )),
g = factor(c(rep(1, 100),
rep(2,100))))
ggplot(data.plot, aes(x, linetype = g)) + geom_density()
我找不到要这样做的教程或示例.谢谢.
I couldn't find a tutorial or examples to do so. Thank you.
推荐答案
最好的解决方案是使用引导程序,如注释中所述.我将使用经典的 iris 数据,重点是每个 Species 的 Sepal.Length 的密度.
The best solution is with bootstrapping, as mentioned in the comments. I'll use the classic iris data, focusing on the density of Sepal.Length for each Species.
library(dplyr)
library(broom)
library(tidyr)
library(ggplot2)
data_frame(bs = 1:1000) %>% group_by(bs) %>%
mutate(data = list(iris %>% group_by(Species) %>%
sample_frac(size = 1, replace = T)))
# A tibble: 1,000 x 2
# Groups: bs [1,000]
bs data
<int> <list>
1 1 <tibble [150 x 5]>
2 2 <tibble [150 x 5]>
3 3 <tibble [150 x 5]>
4 4 <tibble [150 x 5]>
5 5 <tibble [150 x 5]>
6 6 <tibble [150 x 5]>
7 7 <tibble [150 x 5]>
8 8 <tibble [150 x 5]>
9 9 <tibble [150 x 5]>
10 10 <tibble [150 x 5]>
# ... with 990 more rows
因此,我仅对原始数据进行了1000次引导复制,并从每个组中取出了与其原始样本大小相同的行数,并进行了替换.现在,我必须unnest
来访问嵌套的 data 列中的数据.
So I just made 1000 bootstrap replicates of the original data, taking the same number of rows out of each group as its original sample size, with replacement. Now I have to unnest
to access the data in the nested data column.
densities.within <-
data_frame(bs = 1:1000) %>% group_by(bs) %>%
mutate(data = list(iris %>% group_by(Species) %>%
sample_frac(size = 1, replace = T))) %>%
unnest() %>%
group_by(bs, Species) %>%
do(tidy(density(.$Sepal.Length,
from = min(iris$Sepal.Length),
to = max(iris$Sepal.Length),
n = 128)))
# A tibble: 384,000 x 4
# Groups: bs, Species [30,000]
bs Species x y
<int> <fctr> <dbl> <dbl>
1 1 setosa 4.300000 0.2395786
2 1 setosa 4.328346 0.2821128
3 1 setosa 4.356693 0.3235939
4 1 setosa 4.385039 0.3632449
5 1 setosa 4.413386 0.4010378
6 1 setosa 4.441732 0.4375189
7 1 setosa 4.470079 0.4734727
8 1 setosa 4.498425 0.5095333
9 1 setosa 4.526772 0.5459280
10 1 setosa 4.555118 0.5824587
# ... with 383,990 more rows
因此,我们将数据扩展为长格式,然后在 bs 中在 Species 中每个组的 Sepal.Length 的密度进行了计算.我们必须提供手册from =
和to =
,因为每个引导程序中的最小值和最大值可能会有所不同(并设置比默认值512更低的n =
只是为了节省时间).
为了简化生成的S3: density
对象,我们使用broom::tidy
.这是计算量大的步骤,因此我们将该对象另存为 densities.in .
So we expanded the data to its long form, then took the density of each group's Sepal.Length within Species within bs. We had to supply a manual from =
and to =
since the min and max within each bootstrap might differ (and set a lower n =
than the default 512 just to save time).
To simplify the S3: density
object that's generated, we use broom::tidy
. This is the computationally intensive step, so we'll save this object as densities.within.
这将产生名为 x 和 y 的列,但是我们将重命名它们以匹配我们的数据.然后,我们将得出:对于在每个计算出的 Sepal.Length 中计算出的密度,CI的下限,中位数和CI的上限是多少?我们将使用quantile
来获取这些特定的计算密度值.
This results in columns named x and y, but we'll rename these to match our data. Then we'll figure out: for the densities calculated at each computed possible Sepal.Length, what is the lower end of the CI, the median, and the upper end of the CI? We'll use quantile
to get these specific values of the computed densities.
densities.qtiles <-
densities.within %>%
rename(Sepal.Length = x, dens = y) %>%
ungroup() %>%
group_by(Species, Sepal.Length) %>%
summarise(q05 = quantile(dens, 0.025),
q50 = quantile(dens, 0.5),
q95 = quantile(dens, 0.975))
# A tibble: 384 x 5
# Groups: Species [?]
Species Sepal.Length q05 q50 q95
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.300000 0.05730022 0.2355335 0.4426299
2 setosa 4.328346 0.08177850 0.2734463 0.4970097
3 setosa 4.356693 0.09863062 0.3114570 0.5505578
4 setosa 4.385039 0.12459033 0.3430645 0.5884523
5 setosa 4.413386 0.15049699 0.3705389 0.6207344
6 setosa 4.441732 0.17494889 0.4006335 0.6418923
7 setosa 4.470079 0.19836510 0.4258006 0.6655006
8 setosa 4.498425 0.21106857 0.4555755 0.6971370
9 setosa 4.526772 0.23399070 0.4813130 0.7244413
10 setosa 4.555118 0.24863090 0.5108057 0.7708114
# ... with 374 more rows
可视化和比较
ggplot(densities.qtiles, aes(Sepal.Length, q50)) +
facet_wrap(~Species, nrow = 2) +
geom_histogram(data = iris,
aes(Sepal.Length, ..density..),
colour = "black", fill = "white",
binwidth = 0.25, boundary = 0) +
geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") +
stat_density(data = iris,
aes(Sepal.Length, ..density.., color = "raw density"),
size = 2, geom = "line") +
geom_line(size = 1.5, aes(color = "bootstrapped")) +
scale_color_manual(values = c("red", "black")) +
labs(y = "density") +
theme(legend.position = c(0.5,0),
legend.justification = c(0,0))
我还为原始数据添加了直方图和密度层,以进行比较.您会看到1000个引导程序样本的中值密度和原始密度非常接近.
I included histogram and density layers for the original data as well for comparison. You can see that the median and raw densities are very close with 1000 bootstrap samples.
这篇关于将阴影标准误差曲线添加到ggplot2中的geom_density的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!