如何绘制两个ggplot密度分布之间的差异? [英] How to plot the difference between two ggplot density distributions?
问题描述
我想用ggplot2来说明两个相似密度分布之间的差异.这是我拥有的数据类型的一个玩具示例:
I would like to use ggplot2 to illustrate the difference between two similar density distributions. Here is a toy example of the type of data I have:
library(ggplot2)
# Make toy data
n_sp <- 100000
n_dup <- 50000
D <- data.frame(
event=c(rep("sp", n_sp), rep("dup", n_dup) ),
q=c(rnorm(n_sp, mean=2.0), rnorm(n_dup, mean=2.1))
)
# Standard density plot
ggplot( D, aes( x=q, y=..density.., col=event ) ) +
geom_freqpoly()
我不是像上面那样分别绘制每个类别(dup
和sp
)的密度,而是如何绘制一条线来显示这些分布之间的差异?
Rather than separately plot the density for each category ( dup
and sp
) as above, how could I plot a single line that shows the difference between these distributions?
在上面的玩具示例中,如果我从sp
密度分布中减去dup
密度分布,则结果线在图的左侧将大于零(因为存在大量的dup
值).并不是说dup
和sp
类型的观察值可能有所不同.
In the toy example above, if I subtracted the dup
density distribution from the sp
density distribution, the resulting line would be above zero on the left side of the plot (since there is an abundance of smaller sp
values) and below 0 on the right (since there is an abundance of larger dup
values). Not that there may be a different number of observations of type dup
and sp
.
更笼统地说-显示相似密度分布之间差异的最佳方法是什么?
More generally - what is the best way to show differences between similar density distributions?
推荐答案
在ggplot中可能有这样做的方法,但通常最简单的方法是预先进行计算.在这种情况下,请在同一范围内的q
的每个子集上调用density
,然后减去y值.使用dplyr(如果需要,可以转换为基R或data.table),
There may be a way to do this within ggplot, but frequently it's easiest to do the calculations beforehand. In this case, call density
on each subset of q
over the same range, then subtract the y values. Using dplyr (translate to base R or data.table if you wish),
library(dplyr)
library(ggplot2)
D %>% group_by(event) %>%
# calculate densities for each group over same range; store in list column
summarise(d = list(density(q, from = min(.$q), to = max(.$q)))) %>%
# make a new data.frame from two density objects
do(data.frame(x = .$d[[1]]$x, # grab one set of x values (which are the same)
y = .$d[[1]]$y - .$d[[2]]$y)) %>% # and subtract the y values
ggplot(aes(x, y)) + # now plot
geom_line()
这篇关于如何绘制两个ggplot密度分布之间的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!