geom_histogram:错误的垃圾箱? [英] geom_histogram: wrong bins?

查看:150
本文介绍了geom_histogram:错误的垃圾箱?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用ggplot 2.1.0来绘制直方图,并且我有一个关于直方图箱的意外行为。
我在这里给出了一个左闭包的例子(即[0,0.1 [),其带宽为0.1。)

  mydf < -  data.frame(myvar = c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot< - ggplot(mydf,aes(myvar))+ geom_histogram(aes(y = .. count ..),binwidth = 0.1,boundary = 0.1,closed =left)
myplot
ggplot_build(myplot) $ data [[1]]

p>

在这个例子中,可以预期-0.4的值在bin [-0.4,-0.3 [内,但它在bin [-0.5中下降(神秘地) ,-0.4 [。价值-0.1同样的东西落在[-0.2,-0.1 [而不是[-0.1,0 [... etc。



)这里有什么我不完全理解(特别是对于新的中心和边界参数)?或者是ggplot2在那里做奇怪的事情?



在此先感谢您,
祝好,
Arnaud



PS:另请问这里:



解决方案

调整边界参数。在这个例子中,设置在1之下,比如说0.99就行。

  ggplot(data = df,aes(x = var))+ $ b您的用例应该可以调整。 $ b geom_histogram(aes(y = ..count ..),
binwidth = 0.05,
boundary = 0.99,
closed =left)




另一个解决方法是引入您自己的模糊性,例如将数据乘以1,稍微小于机器零点(参见下面的 eps )。在 ggplot2 中,模糊性会乘以1e-7(早期版本)或1e-8(更高版本)。

原因:



这个问题很明显在 ncount 中:

  str(ggplot_build(p)$ data [[1]])
##'data.frame':20 ob​​s。 17个变量:
## $ y:num 1 1 1 1 1 2 1 1 1 0 ...
## $ count:num 1 1 1 1 1 2 1 1 1 0 ...
## $ x:num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
## $ xmin:num -1 -0.9 -0.8 - 0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
## $ xmax:num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
## $密度:num 0.476 0.476 0.476 0.476 0.476 ...
## $ ncount:num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
## $ ndensity:num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
## $ PANEL:int 1 1 1 1 1 1 1 1 1 ...
## $ group:int -1 -1 - 1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin:num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax: num 1 1 1 1 1 2 1 1 1 0 ...
## $ color:logi NA NA NA NA NA ...
## $ fill:chrgrey35grey35 grey35grey35...
## $ size:num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype:num 1 1 1 1 1 1 1 1 1 1 ...
## $ alpha:logi NA NA NA NA NA ...

ggplot_build(p)$ data [[1] ] $ ncount
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5


ROUNDING ERRORS?



看起来像:

  df < -  data.frame(var = as.integer(seq(-100,100,10)))
#在我的系统上,eps <-1.000000000000001#
eps <-1 + 10 * .Machine $ double.eps
p <-ggplot(data = df,aes(x = eps * var / 100) )+
geom_histogram(aes(y = ..count ..),
binwidth = 0.05,
closed =left)
p

(我已经删除了边界选项)





这种行为a在 ggplot2_1.0.1 后的一段时间。查看源代码,例如 bin.R stat-bin.r 位于 https://github.com/ hadley / ggplot2 / blob / master / R ,并且追踪 count 的计算结果会导致函数 bin_vector() code $ c>,它包含以下几行:

  bin_vector < -  function(x,bins,weight = NULL, 
...这里的东西我已经删除了CLARITY ...
cut(x,bin $ break,right = bins $ right_closed,
include.lowest = TRUE)
...这里的东西我已删除清晰...
}

通过将这些功能的当前版本与旧版本进行比较,您应该能够找到不同行为的原因......继续......

SUMMING UP DEBUGGING


$ b

patch bin_vector 函数并将输出打印到屏幕上,看起来:


  1. bins $ fuzzy 正确存储模糊参数在计算中使用了非模糊的 bins $ breaks ,但是就我而言可以看到(并纠正我,如果我错了) bins $模糊不是。


  2. 如果我简单地用 bins $ fuzzy 替换 bin_vector 中的 bins $ breaks code>,返回正确的图。不是一个错误的证明,而是一个建议,或许可以做更多的工作来模拟以前版本 ggplot2 的行为。


  3. bin_vector 的顶部,我期望找到一个条件,在该条件下返回 bins $ breaks 垃圾箱$ fuzzy 。我认为现在已经不存在了。


补丁

b $ b

patch bin_vector 函数,从github源复制函数定义,更方便的是,从终端,与:

  ggplot2 ::: bin_vector 

修改它(修补它)并将其分配到命名空间中:

  library(ggplot2)
bin_vector< - function(x,bins,weight = NULL,pad = FALSE)
{
...这里的东西我已删除CLARITY ...
## MY PATCH:用箱子替换箱子$ break $ fuzzy
bin_idx < - cut(x,箱子$ fuzzy,right =箱子$ right_closed,
include.lowest = TRUE)
...这里的东西我已经删除了清晰...
ggplot2 ::: bin_out(bin_count,bin_x,bin_widths)
##这是补丁函数

assignInNamespace(bin_vector,bin_vector,ns =ggplot2)
df < - data.frame(var = seq(-100,100,10)/ 100)
ggplot(data = df,aes(x = var))+ geom_histogram(aes(y = ..count ..),binwidth = 0.05,boundary = 1,closed =left)
detach
您当前加载的 ggplot2



旧版本



在<$ c版本中出现意外的行为 NOT $ c> 2.0.9.3 2.1.0.1 ,并且似乎来源于当前版本 2.2.0.1 (或者也许是早些时候的 2.2.0.0 ,当我试图调用它时给了我一个错误)。



要安装并加载旧版本,比如 ggplot2_0.9.3 ,请创建一个单独的目录(无需覆盖当前版本),例如 ggplot2093

  URL<  -  http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz
install.packages(URL,repos = NULL,type =source,



$ p
$ b $ p $要加载旧版本,请致电它来自你的本地目录:

pre $ library $(ggplot2,


I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins. I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.

mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]

On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.

Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?

Thanks in advance, Best regards, Arnaud

PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651

解决方案

Edit: The problem described below was fixed in a recent release of ggplot2.

Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version ggplot2_2.0.0. I speculate below about its origin, but first let me present a workaround based on the boundary option.

PROBLEM:

df <- data.frame(var = seq(-100,100,10)/100)
as.list(df) # check the data
$var
 [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
[10] -0.1  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7
[19]  0.8  0.9  1.0
library("ggplot2")
p <- ggplot(data = df, aes(x = var)) + 
    geom_histogram(aes(y = ..count..), 
        binwidth = 0.1, 
        boundary = 0.1, 
        closed = "left")
p

SOLUTION

Tweak the boundary parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.

ggplot(data = df, aes(x = var)) + 
    geom_histogram(aes(y = ..count..), 
        binwidth = 0.05, 
        boundary = 0.99, 
        closed = "left")

(I have made the binwidth narrower for better visual)

Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see eps below). In ggplot2 the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).

CAUSE:

The problem appears clearly in ncount:

str(ggplot_build(p)$data[[1]])
##  'data.frame':   20 obs. of  17 variables:
##   $ y       : num  1 1 1 1 1 2 1 1 1 0 ...
##   $ count   : num  1 1 1 1 1 2 1 1 1 0 ...
##   $ x       : num  -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
##   $ xmin    : num  -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
##   $ xmax    : num  -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
##   $ density : num  0.476 0.476 0.476 0.476 0.476 ...
##   $ ncount  : num  0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
##   $ ndensity: num  1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
##   $ PANEL   : int  1 1 1 1 1 1 1 1 1 1 ...
##   $ group   : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##   $ ymin    : num  0 0 0 0 0 0 0 0 0 0 ...
##   $ ymax    : num  1 1 1 1 1 2 1 1 1 0 ...
##   $ colour  : logi  NA NA NA NA NA NA ...
##   $ fill    : chr  "grey35" "grey35" "grey35" "grey35" ...
##   $ size    : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##   $ linetype: num  1 1 1 1 1 1 1 1 1 1 ...
##   $ alpha   : logi  NA NA NA NA NA NA ...

ggplot_build(p)$data[[1]]$ncount
##  [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5

ROUNDING ERRORS?

Looks like:

    df <- data.frame(var = as.integer(seq(-100,100,10)))
# eps <- 1.000000000000001 # on my system
eps <- 1+10*.Machine$double.eps
p <- ggplot(data = df, aes(x = eps*var/100)) + 
    geom_histogram(aes(y = ..count..), 
                   binwidth = 0.05, 
                   closed = "left")
p

(I have removed the boundary option altogether)

This behaviour appears some time after ggplot2_1.0.1. Looking at the source code, e.g. bin.R and stat-bin.r in https://github.com/hadley/ggplot2/blob/master/R, and tracing the computations of count leads to function bin_vector(), which contains the following lines:

bin_vector <- function(x, bins, weight = NULL, pad = FALSE) {
 ... STUFF HERE I HAVE DELETED FOR CLARITY ...
cut(x, bins$breaks, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
}

By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...

SUMMING UP DEBUGGING

By "patching" the bin_vector function and printing the output to screen, it appears that:

  1. bins$fuzzy correctly stores the fuzzy parameters

  2. The non-fuzzy bins$breaks are used in the computations, but as far as I can see (and correct me if I'm wrong) the bins$fuzzy are not.

  3. If I simply replace bins$breaks with bins$fuzzy at the top of bin_vector, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions of ggplot2.

  4. At the top of bin_vector I expected to find a condition upon which to return either bins$breaks or bins$fuzzy. I think that's missing now.

PATCHING

To "patch" the bin_vector function, copy the function definition from the github source or, more conveniently, from the terminal, with:

 ggplot2:::bin_vector

Modify it (patch it) and assign it into the namespace:

library("ggplot2")
bin_vector <- function (x, bins, weight = NULL, pad = FALSE) 
{
... STUFF HERE I HAVE DELETED FOR CLARITY ...
## MY PATCH: Replace bins$breaks with bins$fuzzy
bin_idx <- cut(x, bins$fuzzy, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
ggplot2:::bin_out(bin_count, bin_x, bin_widths)
## THIS IS THE PATCHED FUNCTION
}
assignInNamespace("bin_vector", bin_vector, ns = "ggplot2")
df <- data.frame(var = seq(-100,100,10)/100)
ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 1, closed = "left")

Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or detach your currently loaded ggplot2.

OLD VERSIONS

The unexpected behaviour is NOT observed in versions 2.0.9.3 or 2.1.0.1 and appears to originate in the current release 2.2.0.1 (or perhaps the earlier 2.2.0.0, which gave me an error when I tried to call it).

To install and load an old version, say ggplot2_0.9.3, create a separate directory (no point in overwriting the current version), say ggplot2093:

URL <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz" 
install.packages(URL, repos = NULL, type = "source", 
    lib = "~/R/testing/ggplot2093") 

To load the old version, call it from your local directory:

library("ggplot2", lib.loc = "~/R/testing/ggplot2093") 

这篇关于geom_histogram:错误的垃圾箱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆