geom_histogram:错误的垃圾箱? [英] geom_histogram: wrong bins?
问题描述
我使用ggplot 2.1.0来绘制直方图,并且我有一个关于直方图箱的意外行为。
我在这里给出了一个左闭包的例子(即[0,0.1 [),其带宽为0.1。)
mydf < - data.frame(myvar = c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot< - ggplot(mydf,aes(myvar))+ geom_histogram(aes(y = .. count ..),binwidth = 0.1,boundary = 0.1,closed =left)
myplot
ggplot_build(myplot) $ data [[1]]
p>
在这个例子中,可以预期-0.4的值在bin [-0.4,-0.3 [内,但它在bin [-0.5中下降(神秘地) ,-0.4 [。价值-0.1同样的东西落在[-0.2,-0.1 [而不是[-0.1,0 [... etc。
)这里有什么我不完全理解(特别是对于新的中心和边界参数)?或者是ggplot2在那里做奇怪的事情?
在此先感谢您,
祝好,
Arnaud
PS:另请问这里:
解决方案
调整边界
参数。在这个例子中,设置在1之下,比如说0.99就行。
ggplot(data = df,aes(x = var))+ $ b您的用例应该可以调整。 $ b geom_histogram(aes(y = ..count ..),
binwidth = 0.05,
boundary = 0.99,
closed =left)
$ c $ (为了更好的视觉效果,我将binwidth缩小了)
另一个解决方法是引入您自己的模糊性,例如将数据乘以1,稍微小于机器零点(参见下面的
eps
)。在ggplot2
中,模糊性会乘以1e-7(早期版本)或1e-8(更高版本)。
原因:
这个问题很明显在
ncount
中:str(ggplot_build(p)$ data [[1]])
##'data.frame':20 obs。 17个变量:
## $ y:num 1 1 1 1 1 2 1 1 1 0 ...
## $ count:num 1 1 1 1 1 2 1 1 1 0 ...
## $ x:num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
## $ xmin:num -1 -0.9 -0.8 - 0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
## $ xmax:num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
## $密度:num 0.476 0.476 0.476 0.476 0.476 ...
## $ ncount:num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
## $ ndensity:num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
## $ PANEL:int 1 1 1 1 1 1 1 1 1 ...
## $ group:int -1 -1 - 1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin:num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax: num 1 1 1 1 1 2 1 1 1 0 ...
## $ color:logi NA NA NA NA NA ...
## $ fill:chrgrey35grey35 grey35grey35...
## $ size:num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype:num 1 1 1 1 1 1 1 1 1 1 ...
## $ alpha:logi NA NA NA NA NA ...
ggplot_build(p)$ data [[1] ] $ ncount
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5
ROUNDING ERRORS?
看起来像:
df < - data.frame(var = as.integer(seq(-100,100,10)))
#在我的系统上,eps <-1.000000000000001#
eps <-1 + 10 * .Machine $ double.eps
p <-ggplot(data = df,aes(x = eps * var / 100) )+
geom_histogram(aes(y = ..count ..),
binwidth = 0.05,
closed =left)
p
(我已经删除了
边界
选项)
这种行为a在
ggplot2_1.0.1
后的一段时间。查看源代码,例如bin.R
和stat-bin.r
位于https://github.com/ hadley / ggplot2 / blob / master / R
,并且追踪count
的计算结果会导致函数bin_vector() code $ c>,它包含以下几行:
bin_vector < - function(x,bins,weight = NULL,
...这里的东西我已经删除了CLARITY ...
cut(x,bin $ break,right = bins $ right_closed,
include.lowest = TRUE)
...这里的东西我已删除清晰...
}
通过将这些功能的当前版本与旧版本进行比较,您应该能够找到不同行为的原因......继续......
SUMMING UP DEBUGGING
$ b
patch bin_vector
函数并将输出打印到屏幕上,看起来:
bins $ fuzzy
正确存储模糊参数在计算中使用了非模糊的bins $ breaks
,但是就我而言可以看到(并纠正我,如果我错了)bins $模糊
不是。
如果我简单地用
bins $ fuzzy
替换bin_vector 中的
bins $ breaks
code>,返回正确的图。不是一个错误的证明,而是一个建议,或许可以做更多的工作来模拟以前版本ggplot2
的行为。
在
bin_vector
的顶部,我期望找到一个条件,在该条件下返回bins $ breaks
或垃圾箱$ fuzzy
。我认为现在已经不存在了。
补丁
b $ b到
patch
bin_vector
函数,从github源复制函数定义,更方便的是,从终端,与:
ggplot2 ::: bin_vector
修改它(修补它)并将其分配到命名空间中:
library(ggplot2)
或
bin_vector< - function(x,bins,weight = NULL,pad = FALSE)
{
...这里的东西我已删除CLARITY ...
## MY PATCH:用箱子替换箱子$ break $ fuzzy
bin_idx < - cut(x,箱子$ fuzzy,right =箱子$ right_closed,
include.lowest = TRUE)
...这里的东西我已经删除了清晰...
ggplot2 ::: bin_out(bin_count,bin_x,bin_widths)
##这是补丁函数
assignInNamespace(bin_vector,bin_vector,ns =ggplot2)
df < - data.frame(var = seq(-100,100,10)/ 100)
ggplot(data = df,aes(x = var))+ geom_histogram(aes(y = ..count ..),binwidth = 0.05,boundary = 1,closed =left)
$ c $为了清楚起见,上面的代码被编辑:该函数有很多类型检查和其他计算,我已经删除了,但您需要修补该功能。在运行该补丁之前,重新启动R会话或detach
您当前加载的ggplot2
。
旧版本
在<$ c版本中出现意外的行为 NOT $ c> 2.0.9.3
2.1.0.1
,并且似乎来源于当前版本2.2.0.1
(或者也许是早些时候的2.2.0.0
,当我试图调用它时给了我一个错误)。
要安装并加载旧版本,比如
ggplot2_0.9.3
,请创建一个单独的目录(无需覆盖当前版本),例如ggplot2093
:URL< - http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz
install.packages(URL,repos = NULL,type =source,
$ p
$ b $ p $要加载旧版本,请致电它来自你的本地目录:
pre $ library $(ggplot2,
I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins. I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.
mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1)) myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left") myplot ggplot_build(myplot)$data[[1]]
On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.
Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?
Thanks in advance, Best regards, Arnaud
PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651
解决方案Edit: The problem described below was fixed in a recent release of
ggplot2
.Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version
ggplot2_2.0.0
. I speculate below about its origin, but first let me present a workaround based on theboundary
option.PROBLEM:
df <- data.frame(var = seq(-100,100,10)/100) as.list(df) # check the data $var [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 [10] -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 [19] 0.8 0.9 1.0 library("ggplot2") p <- ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.1, boundary = 0.1, closed = "left") p
SOLUTION
Tweak the
boundary
parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 0.99, closed = "left")
(I have made the binwidth narrower for better visual)
Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see
eps
below). Inggplot2
the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).CAUSE:
The problem appears clearly in
ncount
:str(ggplot_build(p)$data[[1]]) ## 'data.frame': 20 obs. of 17 variables: ## $ y : num 1 1 1 1 1 2 1 1 1 0 ... ## $ count : num 1 1 1 1 1 2 1 1 1 0 ... ## $ x : num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ... ## $ xmin : num -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ... ## $ xmax : num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ... ## $ density : num 0.476 0.476 0.476 0.476 0.476 ... ## $ ncount : num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ... ## $ ndensity: num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ... ## $ PANEL : int 1 1 1 1 1 1 1 1 1 1 ... ## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... ## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ... ## $ ymax : num 1 1 1 1 1 2 1 1 1 0 ... ## $ colour : logi NA NA NA NA NA NA ... ## $ fill : chr "grey35" "grey35" "grey35" "grey35" ... ## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ... ## $ linetype: num 1 1 1 1 1 1 1 1 1 1 ... ## $ alpha : logi NA NA NA NA NA NA ... ggplot_build(p)$data[[1]]$ncount ## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5 ## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5
ROUNDING ERRORS?
Looks like:
df <- data.frame(var = as.integer(seq(-100,100,10))) # eps <- 1.000000000000001 # on my system eps <- 1+10*.Machine$double.eps p <- ggplot(data = df, aes(x = eps*var/100)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, closed = "left") p
(I have removed the
boundary
option altogether)This behaviour appears some time after
ggplot2_1.0.1
. Looking at the source code, e.g.bin.R
andstat-bin.r
inhttps://github.com/hadley/ggplot2/blob/master/R
, and tracing the computations ofcount
leads to functionbin_vector()
, which contains the following lines:bin_vector <- function(x, bins, weight = NULL, pad = FALSE) { ... STUFF HERE I HAVE DELETED FOR CLARITY ... cut(x, bins$breaks, right = bins$right_closed, include.lowest = TRUE) ... STUFF HERE I HAVE DELETED FOR CLARITY ... }
By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...
SUMMING UP DEBUGGING
By
"patching"
thebin_vector
function and printing the output to screen, it appears that:
bins$fuzzy
correctly stores the fuzzy parametersThe non-fuzzy
bins$breaks
are used in the computations, but as far as I can see (and correct me if I'm wrong) thebins$fuzzy
are not.If I simply replace
bins$breaks
withbins$fuzzy
at the top ofbin_vector
, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions ofggplot2
.At the top of
bin_vector
I expected to find a condition upon which to return eitherbins$breaks
orbins$fuzzy
. I think that's missing now.PATCHING
To
"patch"
thebin_vector
function, copy the function definition from the github source or, more conveniently, from the terminal, with:ggplot2:::bin_vector
Modify it (patch it) and assign it into the namespace:
library("ggplot2") bin_vector <- function (x, bins, weight = NULL, pad = FALSE) { ... STUFF HERE I HAVE DELETED FOR CLARITY ... ## MY PATCH: Replace bins$breaks with bins$fuzzy bin_idx <- cut(x, bins$fuzzy, right = bins$right_closed, include.lowest = TRUE) ... STUFF HERE I HAVE DELETED FOR CLARITY ... ggplot2:::bin_out(bin_count, bin_x, bin_widths) ## THIS IS THE PATCHED FUNCTION } assignInNamespace("bin_vector", bin_vector, ns = "ggplot2") df <- data.frame(var = seq(-100,100,10)/100) ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 1, closed = "left")
Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or
detach
your currently loadedggplot2
.OLD VERSIONS
The unexpected behaviour is NOT observed in versions
2.0.9.3
or2.1.0.1
and appears to originate in the current release2.2.0.1
(or perhaps the earlier2.2.0.0
, which gave me an error when I tried to call it).To install and load an old version, say
ggplot2_0.9.3
, create a separate directory (no point in overwriting the current version), sayggplot2093
:URL <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz" install.packages(URL, repos = NULL, type = "source", lib = "~/R/testing/ggplot2093")
To load the old version, call it from your local directory:
library("ggplot2", lib.loc = "~/R/testing/ggplot2093")
这篇关于geom_histogram:错误的垃圾箱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!