ggplot2直方图，密度曲线和为1 [英] ggplot2 histogram with density curve that sums to 1

查看：472 发布时间：2018/4/24 21:08:20 r ggplot2 histogram

本文介绍了ggplot2直方图，密度曲线和为1的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用密度曲线绘制直方图，对于非标准化数据，总和为1是很荒唐的。关于此问题已经有很多问题了，但是他们的解决方案都不适用于我的数据。需要有一个简单的解决方案才行。我找不到一个简单的解决方案的答案。

一些例子：

解决方案仅适用与标准化的正常数据

所以，显然密度并不等于1。也许geom_histogram需要一个..density ..？
ggplot（t，aes（r））+
geom_histogram（aes（y = ..density ..））+
geom_density（）

它改变了一些东西，但不正确。

  #maybe geom_density也需要..density ..吗？ 
 ggplot（t，aes（r））+ 
 geom_histogram（aes（y = ..density ..））+ 
 geom_density（aes（y = ..density ..））

没有变化。

  #maybe binwidth = 1？ 
 geg_density（aes（y = ..density。 。））

仍然错误的密度曲线，但现在直方图也是错误的。

当然，我花了4个小时尝试各种..count ..和..sum ..和..density ..的组合，但是由于我找不到任何有关这些应用程序如何工作的文档，因此它是半盲试验和错误的结果。

因此，我放弃了，并避免使用ggplot2来总结数据。所以首先我们需要得到正确的数据比例。，并不是那么简单：
$ b $ pre $ get_prop_table = function（x，breaks_ = 20）{
library（ magrittr）
library（plyr）
x_prop_table = cut（x，20）％>％表（。）％>％pr op.table％>％data.frame
colnames（x_prop_table）= c（interval，density）
intervals = x_prop_table $ interval％>％as.character
fetch_numbers = str_extract_all（时间间隔，\\\\\\\\\\\\）
x_prop_table $ means = laply（fetch_numbers，function（x）{
x％>％as .numeric％>％mean
））
return（x_prop_table）
}

t_df = get_prop_table（t $ r）

$ b

这给出了我们想要的汇总数据类型：

 > （t_df）
间隔密度表示
 1（0.00859,0.0585）0.06 0.033545 
 2（0.0585,0.107）0.09 0.082750 
 3（0.107,0.156）0.07 0.131500 
 4（0.156,0.205）0.10 0.180500 
 5（0.205,0.254）0.08 0.229500 
 6（0.254,0.303）0.03 0.278500

现在我们只需要绘制它，应该很容易......

  ggplot（t_df，aes（means，density））+ 
 geom_histogram（stat =identity）+ 
 geom_density（stat =identity）
  pre> 
 
   
 
 
 嗯，不是我想要的。可以肯定的是，我在geom_density中尝试了 stat =identity，此时它抱怨没有y。
  #lets尝试添加..density ..然后
 ggplot（t_df，aes（平均值，密度））+ 
 geom_histogram（stat =identity）+ 
 geom_density（aes（y = ..density ..））
  
  
 
 更奇怪。 
 
好吧，也许让我们放弃从摘要数据获取密度曲线。也许我们需要把这些方法混合一下...... 
  #adding together 
 ggplot（t_df，aes（means ，密度））+ 
 geom_bar（stat =identity）+ 
 geom_density（data = t，aes（r，y = ..density ..），stat ='density'）
  
  
 
 
好吧，至少现在形状是。 
  #lets尝试除以箱数
 ggplot（t_df ，aes（means，density））+ 
 geom_bar（stat =identity）+ 
 geom_density（data = t，aes（r，y = ..density ../ 20），stat ='密度'）
  
  
 
 
 看起来我们有赢家。除了数字是硬编码的。
 ＃删除硬编码？ 
 divisor = nrow（t_df）
 ggplot（t_df，aes（平均值，密度））+ 
 geom_bar（stat =identity）+ 
 geom_density（data = t，aes （r，y = ..density ../ divisor），stat ='density'）
 
 eval中的错误（expr，envir，enclos）：找不到对象'divisor'
  
好吧，我几乎预料到它会起作用。现在我试着添加一些......的来来去去，还有..count ..和..sum ..，第一个给出了另一个错误的结果，第二个给出了一个错误。我也尝试使用乘数（1/20），没有运气。
  #salvation with get（）
 divisor = nrow（t_df）
 ggplot（t_df，aes（means，density））+ 
 geom_bar（stat =identity）+ 
 geom_density（data = t，aes（r，y = ..density ../ get（divisor，pos = 1）），stat ='density'）
  
  
  auc（d1 $ x，d1 $ y）
 ## [1] 1.000921 
 
 integrate.xy （d1 $ x，d1 $ y）
 ## [1] 1.000921 
 
 auc（d1_ks $ x，d1_ks $ y）
 ## [1] 1 
 
 integrate.xy（d1_ks $ x，d1_ks $ y）
 ## [1] 1 
  
对于beta版发行版也是如此：
 
 
 ＃beta dist 
 set.seed（1 ）
 dat < -  rbeta（100,0.5,0.1）
 
 d2 < -  density（dat）
 d2_ks < -  bkde（dat）
 
 par（mfrow = c（2,1））
 plot（d2）
 plot（d2_ks，typ =l）
  
  
  auc（d2 $ x，d2 $ y）
 ## [1] 1.000187 
 
 integrate.xy（d2 $ x，d2 $ y）
 ## [1] 1.000188 
 
 auc（d2_ks $ x，d2_ks $ y）
 ## [1] 1 
 
 int egrate.xy（d2_ks $ x，d2_ks $ y）
 ## [1] 1 
  
  auc 和 integrate.xy 都使用梯形法则，但我运行它们以显示结果并显示结果来自两个不同的功能。  
 
 
问题是密度事实上总和为1，尽管y轴的值导致你相信它们没有。我不确定你在试图解决你的操作问题。
 
Plotting a histogram with a density curve that sums to 1 for non-standardized data is ridiculously difficult. There are many questions already about this, but none of their solutions work for my data. There needs to be a simple solution that just works. I can't find an answer with a simple solution that works.

Some examples:

solution only works with standardized normal data
ggplot2: Overlay histogram with density curve

with discrete data and no density curve
ggplot2 density histogram with width=.5, vline and centered bar positions

no answer
Overlay density and histogram plot with ggplot2 using custom bins

densities do not sum to 1 on my data
Creating a density histogram in ggplot2?

does not sum to 1 on my data
ggplot2 density histogram with custom bin edges

long explanation here with examples, but density is not 1 with my data
"Density" curve overlay on histogram where vertical axis is frequency (aka count) or relative frequency?

--

Some example code:
#Example code
set.seed(1)
t = data.frame(r = runif(100))

#first we try the obvious simple solution that should work
ggplot(t, aes(r)) + 
  geom_histogram() + 
  geom_density()


So, clearly the density does not sum to 1.
#maybe geom_histogram needs a ..density.. ?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density()


It did change something, but not correctly.
#maybe geom_density needs a ..density.. too ?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density(aes(y = ..density..))
No change there.
#maybe binwidth = 1?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..), binwidth=1) + 
  geom_density(aes(y = ..density..))


Still wrong density curve, but now the histogram is wrong too.

To be sure, I did spend 4 hours trying all kinds of combinations of ..count.. and ..sum.. and ..density.., but since I can't find any documentation about how these are supposed to work, it's semi-blind trial and error.

So I gave up and avoided using ggplot2 to summarize the data.

So first we need to get the right proportions data.frame, and that wasn't so simple:
get_prop_table = function(x, breaks_=20){
  library(magrittr)
  library(plyr)
  x_prop_table = cut(x, 20) %>% table(.) %>% prop.table %>% data.frame
  colnames(x_prop_table) = c("interval", "density")
  intervals = x_prop_table$interval %>% as.character
  fetch_numbers = str_extract_all(intervals, "\\d\\.\\d*")
  x_prop_table$means = laply(fetch_numbers, function(x) {
    x %>% as.numeric %>% mean
  })
  return(x_prop_table)
}

t_df = get_prop_table(t$r)
This gives the kind of summary data we want:
> head(t_df)
          interval density    means
1 (0.00859,0.0585]    0.06 0.033545
2   (0.0585,0.107]    0.09 0.082750
3    (0.107,0.156]    0.07 0.131500
4    (0.156,0.205]    0.10 0.180500
5    (0.205,0.254]    0.08 0.229500
6    (0.254,0.303]    0.03 0.278500
Now we just have to plot it. Should be easy...
ggplot(t_df, aes(means, density)) + 
  geom_histogram(stat = "identity") +
  geom_density(stat = "identity")


Umm, not quite what I wanted. To be sure, I did try without stat = "identity" in geom_density, at which point it complained about not having a y.
#lets try adding ..density.. then
ggplot(t_df, aes(means, density)) + 
  geom_histogram(stat = "identity") +
  geom_density(aes(y = ..density..))


Even more strange.

Okay, maybe let's give up on getting the density curve from summary data. Maybe we need to mix the approaches a bit...
#adding together
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density..), stat = 'density')


Ok, at least the shape is right now. Now, we need to somehow scale it down.
#lets try dividing by the number of bins
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../20), stat = 'density')


Looks like we have a winner. Except that the number is hardcoded.
#removing the hardcoding?
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../divisor), stat = 'density')

Error in eval(expr, envir, enclos) : object 'divisor' not found
Well, I almost expected it to work. Now I tried adding some ..'s here and there, also ..count.. and ..sum.., the first which gave another wrong result, the second which threw an error. I also tried using a multiplier (with 1/20), no luck.
#salvation with get()
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../get("divisor", pos = 1)), stat = 'density')


So, I finally got the right figure (I think; I hope).

Please tell me there is an easier way of doing this.

PS. The get() trick does apparently not work within a function. I would have put a working function here for future use, but that wasn't so easy either.
 解决方案 
First, read Wickham on densities in R, noting the foibles and features of each package/function.

The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.

The following shows both this and the inaccuracy of (at least) the defaults of density when compared to, say, KernSmooth::bkde (using base plots for brevity of typing):
library(KernSmooth)
library(flux)
library(sfsmisc)

# uniform dist
set.seed(1)
dat <- runif(100)

d1 <- density(dat)
d1_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d1)
plot(d1_ks, type="l")
auc(d1$x, d1$y)
## [1] 1.000921

integrate.xy(d1$x, d1$y)
## [1] 1.000921

auc(d1_ks$x, d1_ks$y)
## [1] 1

integrate.xy(d1_ks$x, d1_ks$y)
## [1] 1
Do the same for the beta distribution:
# beta dist
set.seed(1)
dat <- rbeta(100, 0.5, 0.1)

d2 <- density(dat)
d2_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d2)
plot(d2_ks, typ="l")
auc(d2$x, d2$y)
## [1] 1.000187

integrate.xy(d2$x, d2$y)
## [1] 1.000188

auc(d2_ks$x, d2_ks$y)
## [1] 1

integrate.xy(d2_ks$x, d2_ks$y)
## [1] 1
auc and integrate.xy both use the trapezoid rule but I ran them to both show that and to show the results from two different functions. 

The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.

                        这篇关于ggplot2直方图，密度曲线和为1的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ggplot2直方图，密度曲线和为1 [英] ggplot2 histogram with density curve that sums to 1

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

ggplot2直方图，密度曲线和为1 [英] ggplot2 histogram with density curve that sums to 1

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭