ggplot2直方图,密度曲线和为1 [英] ggplot2 histogram with density curve that sums to 1
问题描述
一些例子:
解决方案仅适用与标准化的正常数据
所以,显然密度并不等于1。也许geom_histogram需要一个..density ..?
ggplot(t,aes(r))+
geom_histogram(aes(y = ..density ..))+
geom_density()
它改变了一些东西,但不正确。
#maybe geom_density也需要..density ..吗?
ggplot(t,aes(r))+
geom_histogram(aes(y = ..density ..))+
geom_density(aes(y = ..density ..))
没有变化。
#maybe binwidth = 1?
geg_density(aes(y = ..density。 。))
仍然错误的密度曲线,但现在直方图也是错误的。
当然,我花了4个小时尝试各种..count ..和..sum ..和..density ..的组合,但是由于我找不到任何有关这些应用程序如何工作的文档,因此它是半盲试验和错误的结果。
因此,我放弃了,并避免使用ggplot2来总结数据。所以首先我们需要得到正确的数据比例。 ,并不是那么简单:
$ b $ pre $ get_prop_table = function(x,breaks_ = 20){
library( magrittr)
library(plyr)
x_prop_table = cut(x,20)%>%表(。)%>%pr op.table%>%data.frame
colnames(x_prop_table)= c(interval,density)
intervals = x_prop_table $ interval%>%as.character
fetch_numbers = str_extract_all(时间间隔,\\\\\\\\\\\\)
x_prop_table $ means = laply(fetch_numbers,function(x){
x%>%as .numeric%>%mean
))
return(x_prop_table)
}
t_df = get_prop_table(t $ r)
$ c
$ b
这给出了我们想要的汇总数据类型:
> (t_df)
间隔密度表示
1(0.00859,0.0585)0.06 0.033545
2(0.0585,0.107)0.09 0.082750
3(0.107,0.156)0.07 0.131500
4(0.156,0.205)0.10 0.180500
5(0.205,0.254)0.08 0.229500
6(0.254,0.303)0.03 0.278500
现在我们只需要绘制它,应该很容易......
ggplot(t_df,aes(means,density))+
pre>
geom_histogram(stat =identity)+
geom_density(stat =identity)
嗯,不是我想要的。可以肯定的是,我在geom_density中尝试了
stat =identity,此时它抱怨没有y。 #lets尝试添加..density ..然后
ggplot(t_df,aes(平均值,密度))+
geom_histogram(stat =identity)+
geom_density(aes(y = ..density ..))
更奇怪。
好吧,也许让我们放弃从摘要数据获取密度曲线。也许我们需要把这些方法混合一下......#adding together
ggplot(t_df,aes(means ,密度))+
geom_bar(stat =identity)+
geom_density(data = t,aes(r,y = ..density ..),stat ='density')
好吧,至少现在形状是。#lets尝试除以箱数
ggplot(t_df ,aes(means,density))+
geom_bar(stat =identity)+
geom_density(data = t,aes(r,y = ..density ../ 20),stat ='密度')
看起来我们有赢家。除了数字是硬编码的。
#删除硬编码?
divisor = nrow(t_df)
ggplot(t_df,aes(平均值,密度))+
geom_bar(stat =identity)+
geom_density(data = t,aes (r,y = ..density ../ divisor),stat ='density')
eval中的错误(expr,envir,enclos):找不到对象'divisor'
好吧,我几乎预料到它会起作用。现在我试着添加一些......的来来去去,还有..count ..和..sum ..,第一个给出了另一个错误的结果,第二个给出了一个错误。我也尝试使用乘数(1/20),没有运气。
#salvation with get()
divisor = nrow(t_df)
ggplot(t_df,aes(means,density))+
geom_bar(stat =identity)+
geom_density(data = t,aes(r,y = ..density ../ get(divisor,pos = 1)),stat ='density')
auc(d1 $ x,d1 $ y)
## [1] 1.000921
integrate.xy (d1 $ x,d1 $ y)
## [1] 1.000921
auc(d1_ks $ x,d1_ks $ y)
## [1] 1
integrate.xy(d1_ks $ x,d1_ks $ y)
## [1] 1
对于beta版发行版也是如此:
#beta dist
set.seed(1 )
dat < - rbeta(100,0.5,0.1)
d2 < - density(dat)
d2_ks < - bkde(dat)
par(mfrow = c(2,1))
plot(d2)
plot(d2_ks,typ =l)
auc(d2 $ x,d2 $ y)
## [1] 1.000187
integrate.xy(d2 $ x,d2 $ y)
## [1] 1.000188
auc(d2_ks $ x,d2_ks $ y)
## [1] 1
int egrate.xy(d2_ks $ x,d2_ks $ y)
## [1] 1
auc
和integrate.xy
都使用梯形法则,但我运行它们以显示结果并显示结果来自两个不同的功能。
问题是密度事实上总和为1,尽管y轴的值导致你相信它们没有。我不确定你在试图解决你的操作问题。
Plotting a histogram with a density curve that sums to 1 for non-standardized data is ridiculously difficult. There are many questions already about this, but none of their solutions work for my data. There needs to be a simple solution that just works. I can't find an answer with a simple solution that works.
Some examples:
solution only works with standardized normal data ggplot2: Overlay histogram with density curve
with discrete data and no density curve ggplot2 density histogram with width=.5, vline and centered bar positions
no answer Overlay density and histogram plot with ggplot2 using custom bins
densities do not sum to 1 on my data Creating a density histogram in ggplot2?
does not sum to 1 on my data ggplot2 density histogram with custom bin edges
long explanation here with examples, but density is not 1 with my data "Density" curve overlay on histogram where vertical axis is frequency (aka count) or relative frequency?
--
Some example code:
#Example code set.seed(1) t = data.frame(r = runif(100)) #first we try the obvious simple solution that should work ggplot(t, aes(r)) + geom_histogram() + geom_density()
So, clearly the density does not sum to 1.
#maybe geom_histogram needs a ..density.. ? ggplot(t, aes(r)) + geom_histogram(aes(y = ..density..)) + geom_density()
It did change something, but not correctly.
#maybe geom_density needs a ..density.. too ? ggplot(t, aes(r)) + geom_histogram(aes(y = ..density..)) + geom_density(aes(y = ..density..))
No change there.
#maybe binwidth = 1? ggplot(t, aes(r)) + geom_histogram(aes(y = ..density..), binwidth=1) + geom_density(aes(y = ..density..))
Still wrong density curve, but now the histogram is wrong too.
To be sure, I did spend 4 hours trying all kinds of combinations of ..count.. and ..sum.. and ..density.., but since I can't find any documentation about how these are supposed to work, it's semi-blind trial and error.
So I gave up and avoided using ggplot2 to summarize the data.
So first we need to get the right proportions data.frame, and that wasn't so simple:
get_prop_table = function(x, breaks_=20){ library(magrittr) library(plyr) x_prop_table = cut(x, 20) %>% table(.) %>% prop.table %>% data.frame colnames(x_prop_table) = c("interval", "density") intervals = x_prop_table$interval %>% as.character fetch_numbers = str_extract_all(intervals, "\\d\\.\\d*") x_prop_table$means = laply(fetch_numbers, function(x) { x %>% as.numeric %>% mean }) return(x_prop_table) } t_df = get_prop_table(t$r)
This gives the kind of summary data we want:
> head(t_df) interval density means 1 (0.00859,0.0585] 0.06 0.033545 2 (0.0585,0.107] 0.09 0.082750 3 (0.107,0.156] 0.07 0.131500 4 (0.156,0.205] 0.10 0.180500 5 (0.205,0.254] 0.08 0.229500 6 (0.254,0.303] 0.03 0.278500
Now we just have to plot it. Should be easy...
ggplot(t_df, aes(means, density)) + geom_histogram(stat = "identity") + geom_density(stat = "identity")
Umm, not quite what I wanted. To be sure, I did try without
stat = "identity"
in geom_density, at which point it complained about not having a y.#lets try adding ..density.. then ggplot(t_df, aes(means, density)) + geom_histogram(stat = "identity") + geom_density(aes(y = ..density..))
Even more strange.
Okay, maybe let's give up on getting the density curve from summary data. Maybe we need to mix the approaches a bit...
#adding together ggplot(t_df, aes(means, density)) + geom_bar(stat = "identity") + geom_density(data=t, aes(r, y = ..density..), stat = 'density')
Ok, at least the shape is right now. Now, we need to somehow scale it down.
#lets try dividing by the number of bins ggplot(t_df, aes(means, density)) + geom_bar(stat = "identity") + geom_density(data=t, aes(r, y = ..density../20), stat = 'density')
Looks like we have a winner. Except that the number is hardcoded.
#removing the hardcoding? divisor = nrow(t_df) ggplot(t_df, aes(means, density)) + geom_bar(stat = "identity") + geom_density(data=t, aes(r, y = ..density../divisor), stat = 'density') Error in eval(expr, envir, enclos) : object 'divisor' not found
Well, I almost expected it to work. Now I tried adding some ..'s here and there, also ..count.. and ..sum.., the first which gave another wrong result, the second which threw an error. I also tried using a multiplier (with 1/20), no luck.
#salvation with get() divisor = nrow(t_df) ggplot(t_df, aes(means, density)) + geom_bar(stat = "identity") + geom_density(data=t, aes(r, y = ..density../get("divisor", pos = 1)), stat = 'density')
So, I finally got the right figure (I think; I hope).
Please tell me there is an easier way of doing this.
PS. The
get()
trick does apparently not work within a function. I would have put a working function here for future use, but that wasn't so easy either.解决方案First, read Wickham on densities in R, noting the foibles and features of each package/function.
The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.
The following shows both this and the inaccuracy of (at least) the defaults of
density
when compared to, say,KernSmooth::bkde
(using base plots for brevity of typing):library(KernSmooth) library(flux) library(sfsmisc) # uniform dist set.seed(1) dat <- runif(100) d1 <- density(dat) d1_ks <- bkde(dat) par(mfrow=c(2,1)) plot(d1) plot(d1_ks, type="l")
auc(d1$x, d1$y) ## [1] 1.000921 integrate.xy(d1$x, d1$y) ## [1] 1.000921 auc(d1_ks$x, d1_ks$y) ## [1] 1 integrate.xy(d1_ks$x, d1_ks$y) ## [1] 1
Do the same for the beta distribution:
# beta dist set.seed(1) dat <- rbeta(100, 0.5, 0.1) d2 <- density(dat) d2_ks <- bkde(dat) par(mfrow=c(2,1)) plot(d2) plot(d2_ks, typ="l")
auc(d2$x, d2$y) ## [1] 1.000187 integrate.xy(d2$x, d2$y) ## [1] 1.000188 auc(d2_ks$x, d2_ks$y) ## [1] 1 integrate.xy(d2_ks$x, d2_ks$y) ## [1] 1
auc
andintegrate.xy
both use the trapezoid rule but I ran them to both show that and to show the results from two different functions.The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.
这篇关于ggplot2直方图,密度曲线和为1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!