R中启动库的cv.glm中的cost函数 [英] cost function in cv.glm of boot library in R

查看:1166
本文介绍了R中启动库的cv.glm中的cost函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用R中启动库中的交叉验证cv.glm函数来确定应用glm logistic回归时的错误分类数。

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.

该函数具有以下签名:

cv.glm(data, glmfit, cost, K)

前两个表示数据和模型,而K指定k折。
我的问题是成本参数,其定义为:

with the first two denoting the data and model and K specifies the k-fold. My problem is the cost parameter which is defined as:


cost:两个向量参数的函数,用于指定交叉验证的成本函数
。成本的第一个参数应与观察到的响应相对应
,第二个参数应与广义线性模型的预测或拟合响应相对应

成本必须返回非负的标量值。默认值为
均方误差函数。

我想对分类进行分类是有意义的返回错误分类率的函数,例如:

I guess for classification it would make sense to have a function which returns the rate of misclassification something like:

nrow(subset(data, (predict >= 0.5  & data$response == "no") | 
                  (predict <  0.5  & data$response == "yes")))

这当然在语法上也不正确。

which is of course not even syntactically correct.

不幸的是,我有限的R知识让我浪费了很多时间,我想知道是否有人可以指出正确的方向

Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.

推荐答案

听起来,仅使用cost函数可能会做得很好(即名为 cost的函数) )在?cv.glm 的示例部分中进一步定义。引用该部分内容:

It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:

 # [...] Since the response is a binary variable an
 # appropriate cost function is
 cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

这基本上就是您要为示例所做的工作。用 0 1 替换否和是,假设您有两个向量,预测响应。然后 cost()可以很好地设计为接受它们并返回平均分类率:

This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:

## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <-  rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1

## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333  ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
                 ## (assuming you use 0.5 as the cutoff for your predictions).

我想其中最棘手的一点就是让您的想法完全被将函数作为参数传递。 (至少对我来说,在最长的时间里,这是使用 boot 包的最困难的部分,这需要在很多地方移动。)

I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)

添加于2016-03-22:

函数<我认为上述code> cost()不必要地被混淆了;以下替代方法执行的操作完全相同,但表达方式更精确:

The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:

cost <- function(r, pi = 0) { 
        mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}

这篇关于R中启动库的cv.glm中的cost函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆