在R ggplot2中,包含stat_ecdf()端点(0,0)和(1,1) [英] In R ggplot2, include stat_ecdf() endpoints (0,0) and (1,1)
问题描述
我尝试使用 stat_ecdf()
将累计成功绘制为由预测模型创建的排名分数的函数。
#libraries
require(ggplot2)
require(比例)
#重现性的假数据
set.seed(123)
n < - 200
df < - data.frame(model_score = rexp(n = n,rate = 1:n),
obs_set = sample c(training,validation),n,replace = TRUE))
df $ model_rank < - rank(df $ model_score)/ n
df $ target_outcome< - rbinom(n, 1,1-df $ model_rank)
#使用stat_ecdf()
ggplot(子集(df,target_outcome == 1),aes(x = model_rank))+
stat_ecdf(aes(color = obs_set),size = 1)+
scale_x_continuous(limits = c(0,1),labels = percent,breaks = seq(0,1,.1))+
xlab(Model Percentile)+ ylab(目标结果百分比)+
scale_y_continuous(limits = c(0,1),labels = percent)+
geom_segment(aes(x = 0,y = 0,xend = 1,yend = 1),
color =gray,l +
ggtitle(增益图表)
所有我想要的do强制ECDF在(0,0)处开始并在(1,1)处结束,以便在曲线的开始或结束处没有间隙。如果可能的话,我希望在 ggplot2
的语法内完成它,但我会找出一个聪明的解决方法。
<@> @Henrik这不是这个问题,因为我已经用
scale_x _
和 _y_continuous()
定义了我的限制, code> expand_limits()不会执行任何操作。它不是PLOT的起源,而是需要修正的stat_ecdf()的终结点。 不幸的是,定义 stat_ecdf
在这里没有摆动空间;它会在内部确定终点。
有一些先进的解决方案。使用最新版本的ggplot2( devtools :: install_github(hadley / ggplot2)
),可扩展性得到了改善,可以覆盖此行为,但不是没有一些样板。
stat_ecdf2 < - function(mapping = NULL,data = NULL,geom =step,
position =identity,n = NULL,show.legend = NA,
inherit.aes = TRUE,minval = NULL,maxval = NULL,...){
layer(
data = data,
mapping = mapping,
stat = StatEcdf2,
geom = geom,
position = position,
show.legend = show.legend ,
inherit.aes = inherit.aes,
stat_params = list(n = n,minval = minval,maxval = maxval),
params = list(...)
)
StatEcdf2 < - ggproto(StatEcdf2,StatEcdf,
calculate = function(data,scale,n = NULL,minval = NULL, maxval = NULL,...){
df< - StatEcdf $ calculate(data,scales,n, ...)
if(!is.null(minval)){df $ x [1]< - minval}
if(!is.null(maxval)){df $ x [length (df $ x)] < - maxval}
df
}
)
现在, stat_ecdf2
的行为与 stat_ecdf
相同,但是有一个可选的 minval
和 maxval
参数。所以这将做到这一点:
ggplot(subset(df,target_outcome == 1),aes(x = model_rank)) + b $ b stat_ecdf2(aes(color = obs_set),size = 1,minval = 0,maxval = 1)+
scale_x_continuous(limits = c(0,1),labels = percent,breaks = seq 0,1,.1))+
xlab(模型百分比)+ ylab(目标结果百分比)+
scale_y_continuous(限制= c(0,1),labels =百分比) +
geom_segment(aes(x = 0,y = 0,xend = 1,yend = 1),
color =gray,linetype =longdash,size = 1)+
ggtitle(增益图表)
这里的重要警告是我不知道当前的可扩展性模型将在未来得到支持;它在过去已经改变了好几次,使用ggproto的改变是最近的 - 就像2015年7月15日最近的那样。
另外,这给了我一个真正深入ggplot内部的机会,这是我一直想要做的事情。
I'm trying to use stat_ecdf()
to plot cumulative successes as a function of a rank score created by a predictive model.
#libraries
require(ggplot2)
require(scales)
# fake data for reproducibility
set.seed(123)
n <- 200
df <- data.frame(model_score= rexp(n=n,rate=1:n),
obs_set= sample(c("training","validation"),n,replace=TRUE))
df$model_rank <- rank(df$model_score)/n
df$target_outcome <- rbinom(n,1,1-df$model_rank)
# Plot Gain Chart using stat_ecdf()
ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
stat_ecdf(aes(colour = obs_set), size=1) +
scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
xlab("Model Percentile") + ylab("Percent of Target Outcome") +
scale_y_continuous(limits=c(0,1), labels=percent) +
geom_segment(aes(x=0,y=0,xend=1,yend=1),
colour = "gray", linetype="longdash", size=1) +
ggtitle("Gain Chart")
All I want to do is force the ECDF to start at (0,0) and end at (1,1) so that there are no gaps at the beginning or end of the curve. If possible, I'd like to do it within the syntax of ggplot2
, but I'd settle for a clever workaround.
@Henrik this is NOT a duplicate of this question, because I have already defined my limits with scale_x_
and _y_continuous()
, and adding expand_limits()
doesn't do anything. It is not the origin of the PLOT but the endpoints of the stat_ecdf() that need fixed.
Unfortunately, the definition of stat_ecdf
gives no wiggle room here; it determines the endpoints internally.
There is a somewhat advanced solution. With the latest version of ggplot2 (devtools::install_github("hadley/ggplot2")
), the extensibility is improved, to the point where it is possible to override this behavior, but not without some boilerplate.
stat_ecdf2 <- function(mapping = NULL, data = NULL, geom = "step",
position = "identity", n = NULL, show.legend = NA,
inherit.aes = TRUE, minval=NULL, maxval=NULL,...) {
layer(
data = data,
mapping = mapping,
stat = StatEcdf2,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
stat_params = list(n = n, minval=minval,maxval=maxval),
params = list(...)
)
}
StatEcdf2 <- ggproto("StatEcdf2", StatEcdf,
calculate = function(data, scales, n = NULL, minval=NULL, maxval=NULL, ...) {
df <- StatEcdf$calculate(data, scales, n, ...)
if (!is.null(minval)) { df$x[1] <- minval }
if (!is.null(maxval)) { df$x[length(df$x)] <- maxval }
df
}
)
Now, stat_ecdf2
will behave the same as stat_ecdf
, but with an optional minval
and maxval
parameter. So this will do the trick:
ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
stat_ecdf2(aes(colour = obs_set), size=1, minval=0, maxval=1) +
scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
xlab("Model Percentile") + ylab("Percent of Target Outcome") +
scale_y_continuous(limits=c(0,1), labels=percent) +
geom_segment(aes(x=0,y=0,xend=1,yend=1),
colour = "gray", linetype="longdash", size=1) +
ggtitle("Gain Chart")
The big caveat here is that I don't know if the current extensibility model will be supported in the future; it has changed several times in the past, and the change to use "ggproto" is recent -- like July 15th 2015 recent.
As a plus, this gave me a chance to really dig into ggplot's internals, which is something that I've been meaning to do for a while.
这篇关于在R ggplot2中,包含stat_ecdf()端点(0,0)和(1,1)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!