ggplot2直方图,密度曲线和为1 [英] ggplot2 histogram with density curve that sums to 1

查看:472
本文介绍了ggplot2直方图,密度曲线和为1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用密度曲线绘制直方图,对于非标准化数据,总和为1是很荒唐的。关于此问题已经有很多问题了,但是他们的解决方案都不适用于我的数据。需要有一个简单的解决方案才行。我找不到一个简单的解决方案的答案。



一些例子:

解决方案仅适用与标准化的正常数据



所以,显然密度并不等于1。也许geom_histogram需要一个..density ..?
ggplot(t,aes(r))+
geom_histogram(aes(y = ..density ..))+
geom_density()



它改变了一些东西,但不正确。

  #maybe geom_density也需要..density ..吗? 
ggplot(t,aes(r))+
geom_histogram(aes(y = ..density ..))+
geom_density(aes(y = ..density ..))

没有变化。

  #maybe binwidth = 1? 
geg_density(aes(y = ..density。 。))



仍然错误的密度曲线,但现在直方图也是错误的。

当然,我花了4个小时尝试各种..count ..和..sum ..和..density ..的组合,但是由于我找不到任何有关这些应用程序如何工作的文档,因此它是半盲试验和错误的结果。



因此,我放弃了,并避免使用ggplot2来总结数据。所以首先我们需要得到正确的数据比例。
,并不是那么简单:
$ b $ pre $ get_prop_table = function(x,breaks_ = 20){
library( magrittr)
library(plyr)
x_prop_table = cut(x,20)%>%表(。)%>%pr op.table%>%data.frame
colnames(x_prop_table)= c(interval,density)
intervals = x_prop_table $ interval%>%as.character
fetch_numbers = str_extract_all(时间间隔,\\\\\\\\\\\\)
x_prop_table $ means = laply(fetch_numbers,function(x){
x%>%as .nu​​meric%>%mean
))
return(x_prop_table)
}

t_df = get_prop_table(t $ r)

$ b

这给出了我们想要的汇总数据类型:

 > (t_df)
间隔密度表示
1(0.00859,0.0585)0.06 0.033545
2(0.0585,0.107)0.09 0.082750
3(0.107,0.156)0.07 0.131500
4(0.156,0.205)0.10 0.180500
5(0.205,0.254)0.08 0.229500
6(0.254,0.303)0.03 0.278500

现在我们只需要绘制它,应该很容易......

  ggplot(t_df,aes(means,density))+ 
geom_histogram(stat =identity)+
geom_density(stat =identity)
pre>



嗯,不是我想要的。可以肯定的是,我在geom_density中尝试了 stat =identity,此时它抱怨没有y。

  #lets尝试添加..density ..然后
ggplot(t_df,aes(平均值,密度))+
geom_histogram(stat =identity)+
geom_density(aes(y = ..density ..))



更奇怪。

好吧,也许让我们放弃从摘要数据获取密度曲线。也许我们需要把这些方法混合一下......

  #adding together 
ggplot(t_df,aes(means ,密度))+
geom_bar(stat =identity)+
geom_density(data = t,aes(r,y = ..density ..),stat ='density')



好吧,至少现在形状是。

  #lets尝试除以箱数
ggplot(t_df ,aes(means,density))+
geom_bar(stat =identity)+
geom_density(data = t,aes(r,y = ..density ../ 20),stat ='密度')



看起来我们有赢家。除了数字是硬编码的。

 #删除硬编码? 
divisor = nrow(t_df)
ggplot(t_df,aes(平均值,密度))+
geom_bar(stat =identity)+
geom_density(data = t,aes (r,y = ..density ../ divisor),stat ='density')

eval中的错误(expr,envir,enclos):找不到对象'divisor'

好吧,我几乎预料到它会起作用。现在我试着添加一些......的来来去去,还有..count ..和..sum ..,第一个给出了另一个错误的结果,第二个给出了一个错误。我也尝试使用乘数(1/20),没有运气。

  #salvation with get()
divisor = nrow(t_df)
ggplot(t_df,aes(means,density))+
geom_bar(stat =identity)+
geom_density(data = t,aes(r,y = ..density ../ get(divisor,pos = 1)),stat ='density')

  auc(d1 $ x,d1 $ y)
## [1] 1.000921

integrate.xy (d1 $ x,d1 $ y)
## [1] 1.000921

auc(d1_ks $ x,d1_ks $ y)
## [1] 1

integrate.xy(d1_ks $ x,d1_ks $ y)
## [1] 1

对于beta版发行版也是如此:

 #beta dist 
set.seed(1 )
dat < - rbeta(100,0.5,0.1)

d2 < - density(dat)
d2_ks < - bkde(dat)

par(mfrow = c(2,1))
plot(d2)
plot(d2_ks,typ =l)

  auc(d2 $ x,d2 $ y)
## [1] 1.000187

integrate.xy(d2 $ x,d2 $ y)
## [1] 1.000188

auc(d2_ks $ x,d2_ks $ y)
## [1] 1

int egrate.xy(d2_ks $ x,d2_ks $ y)
## [1] 1

auc integrate.xy 都使用梯形法则,但我运行它们以显示结果并显示结果来自两个不同的功能。

问题是密度事实上总和为1,尽管y轴的值导致你相信它们没有。我不确定你在试图解决你的操作问题。


Plotting a histogram with a density curve that sums to 1 for non-standardized data is ridiculously difficult. There are many questions already about this, but none of their solutions work for my data. There needs to be a simple solution that just works. I can't find an answer with a simple solution that works.

Some examples:

solution only works with standardized normal data ggplot2: Overlay histogram with density curve

with discrete data and no density curve ggplot2 density histogram with width=.5, vline and centered bar positions

no answer Overlay density and histogram plot with ggplot2 using custom bins

densities do not sum to 1 on my data Creating a density histogram in ggplot2?

does not sum to 1 on my data ggplot2 density histogram with custom bin edges

long explanation here with examples, but density is not 1 with my data "Density" curve overlay on histogram where vertical axis is frequency (aka count) or relative frequency?

--

Some example code:

#Example code
set.seed(1)
t = data.frame(r = runif(100))

#first we try the obvious simple solution that should work
ggplot(t, aes(r)) + 
  geom_histogram() + 
  geom_density()

So, clearly the density does not sum to 1.

#maybe geom_histogram needs a ..density.. ?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density()

It did change something, but not correctly.

#maybe geom_density needs a ..density.. too ?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density(aes(y = ..density..))

No change there.

#maybe binwidth = 1?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..), binwidth=1) + 
  geom_density(aes(y = ..density..))

Still wrong density curve, but now the histogram is wrong too.

To be sure, I did spend 4 hours trying all kinds of combinations of ..count.. and ..sum.. and ..density.., but since I can't find any documentation about how these are supposed to work, it's semi-blind trial and error.

So I gave up and avoided using ggplot2 to summarize the data.

So first we need to get the right proportions data.frame, and that wasn't so simple:

get_prop_table = function(x, breaks_=20){
  library(magrittr)
  library(plyr)
  x_prop_table = cut(x, 20) %>% table(.) %>% prop.table %>% data.frame
  colnames(x_prop_table) = c("interval", "density")
  intervals = x_prop_table$interval %>% as.character
  fetch_numbers = str_extract_all(intervals, "\\d\\.\\d*")
  x_prop_table$means = laply(fetch_numbers, function(x) {
    x %>% as.numeric %>% mean
  })
  return(x_prop_table)
}

t_df = get_prop_table(t$r)

This gives the kind of summary data we want:

> head(t_df)
          interval density    means
1 (0.00859,0.0585]    0.06 0.033545
2   (0.0585,0.107]    0.09 0.082750
3    (0.107,0.156]    0.07 0.131500
4    (0.156,0.205]    0.10 0.180500
5    (0.205,0.254]    0.08 0.229500
6    (0.254,0.303]    0.03 0.278500

Now we just have to plot it. Should be easy...

ggplot(t_df, aes(means, density)) + 
  geom_histogram(stat = "identity") +
  geom_density(stat = "identity")

Umm, not quite what I wanted. To be sure, I did try without stat = "identity" in geom_density, at which point it complained about not having a y.

#lets try adding ..density.. then
ggplot(t_df, aes(means, density)) + 
  geom_histogram(stat = "identity") +
  geom_density(aes(y = ..density..))

Even more strange.

Okay, maybe let's give up on getting the density curve from summary data. Maybe we need to mix the approaches a bit...

#adding together
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density..), stat = 'density')

Ok, at least the shape is right now. Now, we need to somehow scale it down.

#lets try dividing by the number of bins
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../20), stat = 'density')

Looks like we have a winner. Except that the number is hardcoded.

#removing the hardcoding?
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../divisor), stat = 'density')

Error in eval(expr, envir, enclos) : object 'divisor' not found

Well, I almost expected it to work. Now I tried adding some ..'s here and there, also ..count.. and ..sum.., the first which gave another wrong result, the second which threw an error. I also tried using a multiplier (with 1/20), no luck.

#salvation with get()
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../get("divisor", pos = 1)), stat = 'density')

So, I finally got the right figure (I think; I hope).

Please tell me there is an easier way of doing this.

PS. The get() trick does apparently not work within a function. I would have put a working function here for future use, but that wasn't so easy either.

解决方案

First, read Wickham on densities in R, noting the foibles and features of each package/function.

The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.

The following shows both this and the inaccuracy of (at least) the defaults of density when compared to, say, KernSmooth::bkde (using base plots for brevity of typing):

library(KernSmooth)
library(flux)
library(sfsmisc)

# uniform dist
set.seed(1)
dat <- runif(100)

d1 <- density(dat)
d1_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d1)
plot(d1_ks, type="l")

auc(d1$x, d1$y)
## [1] 1.000921

integrate.xy(d1$x, d1$y)
## [1] 1.000921

auc(d1_ks$x, d1_ks$y)
## [1] 1

integrate.xy(d1_ks$x, d1_ks$y)
## [1] 1

Do the same for the beta distribution:

# beta dist
set.seed(1)
dat <- rbeta(100, 0.5, 0.1)

d2 <- density(dat)
d2_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d2)
plot(d2_ks, typ="l")

auc(d2$x, d2$y)
## [1] 1.000187

integrate.xy(d2$x, d2$y)
## [1] 1.000188

auc(d2_ks$x, d2_ks$y)
## [1] 1

integrate.xy(d2_ks$x, d2_ks$y)
## [1] 1

auc and integrate.xy both use the trapezoid rule but I ran them to both show that and to show the results from two different functions.

The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.

这篇关于ggplot2直方图,密度曲线和为1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆