如何正确指定要在optim()或其他优化程序中使用的梯度函数 [英] how to propery specify a gradient function for use in optim() or other optimizer

查看:122
本文介绍了如何正确指定要在optim()或其他优化程序中使用的梯度函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Nelder-Mead方法可以解决的优化问题,但是我也想使用BFGS或Newton-Raphson或带有梯度函数的东西来解决,以提高速度并希望更精确估计.我按照(我认为)optim/optimx文档中的示例编写了这样的梯度函数,但是当我将其与BFGS一起使用时,我的起始值要么不动(optim()),否则函数完全无法运行(optimx(),它返回Error: Gradient function might be wrong - check it!).很抱歉,复制此代码涉及到一些代码,但是这里有:

I have an optimization problem that the Nelder-Mead method will solve, but that I would also like to solve using BFGS or Newton-Raphson, or something that takes a gradient function, for more speed, and hopefully more precise estimates. I wrote such a gradient function following (I thought) the example in the optim / optimx documentation, but when I use it with BFGS my starting values either don't move (optim()), or else the function outright doesn't run (optimx(), which returns Error: Gradient function might be wrong - check it!). I'm sorry there's a bit of code involved in reproducing this, but here goes:

这是我要获取参数估计值的函数(这是为了平滑从80岁开始的x年龄,其中x是年龄):

This is the function that I want to get parameter estimates for (this is for smoothing old-age mortality rates, where x is age, starting at age 80):

    KannistoMu <- function(pars, x = .5:30.5){
      a <- pars["a"]
      b <- pars["b"]
      (a * exp(b * x)) / (1 + a * exp(b * x))
    }

这是一个对数似然函数,用于根据观察到的比率(定义为死亡,.Dx过度暴露,.Exp)对其进行估算:

And here's a log likelihood function for estimating it from observed rates (defined as deaths, .Dx over exposure, .Exp):

    KannistoLik1 <- function(pars, .Dx, .Exp, .x. = .5:30.5){
      mu <- KannistoMu(exp(pars), x = .x.)
      # take negative and minimize it (default optimizer behavior)
      -sum(.Dx * log(mu) - .Exp * mu, na.rm = TRUE) 
    }

您在其中看到了exp(pars),因为我给log(pars)进行了优化,以便将最终的ab约束为正.

you see exp(pars) in there because I give log(pars) to optimize over, in order to constrain the final a and b to be positive.

示例数据(1962年的日本女性,如果有人好奇的话):

Example data (1962 Japan females, if anyone is curious):

    .Dx <- structure(c(10036.12, 9629.12, 8810.11, 8556.1, 7593.1, 6975.08, 
      6045.08, 4980.06, 4246.06, 3334.04, 2416.03, 1676.02, 1327.02, 
      980.02, 709, 432, 350, 217, 134, 56, 24, 21, 10, 8, 3, 1, 2, 
      1, 0, 0, 0), .Names = c("80", "81", "82", "83", "84", "85", "86", 
      "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", 
      "98", "99", "100", "101", "102", "103", "104", "105", "106", 
      "107", "108", "109", "110"))
    .Exp <- structure(c(85476.0333333333, 74002.0866666667, 63027.5183333333, 
      53756.8983333333, 44270.9, 36749.85, 29024.9333333333, 21811.07, 
      16912.315, 11917.9583333333, 7899.33833333333, 5417.67, 3743.67833333333, 
      2722.435, 1758.95, 1043.985, 705.49, 443.818333333333, 223.828333333333, 
      93.8233333333333, 53.1566666666667, 27.3333333333333, 16.1666666666667, 
      10.5, 4.33333333333333, 3.16666666666667, 3, 2.16666666666667, 
      1.5, 0, 1), .Names = c("80", "81", "82", "83", "84", "85", "86", 
      "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", 
      "98", "99", "100", "101", "102", "103", "104", "105", "106", 
      "107", "108", "109", "110"))

以下方法可用于Nelder-Mead方法:

    NMab <- optim(log(c(a = .1, b = .1)), 
      fn = KannistoLik1, method = "Nelder-Mead",
      .Dx = .Dx, .Exp = .Exp)
    exp(NMab$par) 
    # these are reasonable estimates
       a         b 
    0.1243144 0.1163926

这是我想出的渐变函数:

This is the gradient function I came up with:

    Kannisto.gr <- function(pars, .Dx, .Exp, x = .5:30.5){
      a <- exp(pars["a"])
      b <- exp(pars["b"])
      d.a <- (a * exp(b * x) * .Exp + (-a * exp(b * x) - 1) * .Dx) /
        (a ^ 3 * exp(2 * b * x) + 2 * a ^ 2 * exp(b * x) + a)
      d.b <- (a * x * exp(b * x) * .Exp + (-a * x * exp(b * x) - x) * .Dx) /
        (a ^ 2 * exp(2 * b * x) + 2 * a * exp(b * x) + 1)
      -colSums(cbind(a = d.a, b = d.b), na.rm = TRUE)
    }

输出是长度为2的向量,相对于参数ab的变化.我还通过利用deriv()的输出得到了一个更丑陋的版本,该输出返回相同的答案,并且我不发布(只是为了确认派生词是正确的).

The output is a vector of length 2, the change with respect to the parameters a and b. I also have an uglier version arrived at by exploiting the output of deriv(), which returns the same answer, and which I don't post (just to confirm that the derivatives are right).

如果按以下方式将其提供给optim(),并以BFGS作为方法,则估计值不会偏离初始值:

If I supply it to optim() as follows, with BFGS as the method, the estimates do not move from the starting values:

    BFGSab <- optim(log(c(a = .1, b = .1)), 
      fn = KannistoLik1, gr = Kannisto.gr, method = "BFGS",
      .Dx = .Dx, .Exp = .Exp)
    # estimates do not change from starting values:
    exp(BFGSab$par) 
      a   b 
    0.1 0.1

当我查看输出的$counts元素时,它说KannistoLik1()被调用31次,而Kannisto.gr()仅被调用1次. $convergence0,所以我想它认为它已经收敛(如果我给出的合理起步不多,他们也会留在原地).我降低了容忍度,但没有任何变化.当我在optimx()中尝试相同的调用(未显示)时,收到上面提到的警告,并且没有对象返回.当用"CG"指定gr = Kannisto.gr时,我得到相同的结果.使用"L-BFGS-B"方法,我得到的初始值与估计值相同,但是据报道,函数和渐变都被调用了21次,并且出现错误消息: "ERROR: BNORMAL_TERMINATION_IN_LNSRCH"

When I look at the $counts element of the output, it says that KannistoLik1() was called 31 times and Kannisto.gr() just 1 time. $convergence is 0, so I guess it thinks it converged (if I give less reasonable starts they also stay put). I reduced the tolerance, etc, and nothing changes. When I try the same call in optimx() (not shown), I receive the waring I mentioned above, and no object is returned. I get the same results when specifying gr = Kannisto.gr with the "CG". With the "L-BFGS-B" method I get the same starting values back as estimate, but it is also reported that both function and gradient were called 21 times, and there is an error message: "ERROR: BNORMAL_TERMINATION_IN_LNSRCH"

我希望渐变函数的编写方式能解决此问题,因为稍后的警告和optimx行为直截了当地暗示该函数不正确(我认为).我还尝试了maxLik包中的maxNR()最大化器,并观察到了类似的行为(起始值不动).谁能给我一个指针?非常有义务

I'm hoping that there is some minor detail in the way the gradient function is written that will solve this, as this later warning and the optimx behavior are bluntly hinting that the function simply isn't right (I think). I also tried the maxNR() maximizer from the maxLik package and observed similar behavior (starting values don't move). Can anyone give me a pointer? Much obliged

@Vincent建议我将其与数值近似的输出进行比较:

@Vincent suggested I compare with the output from a numerical approximation:

    library(numDeriv)
    grad( function(u) KannistoLik1( c(a=u[1], b=u[2]), .Dx, .Exp ), log(c(.1,.1)) )
    [1] -14477.40  -7458.34
    Kannisto.gr(log(c(a=.1,b=.1)), .Dx, .Exp)
     a        b 
    144774.0  74583.4 

如此不同的符号,相差10倍?我更改了渐变函数以适应以下情况:

so different sign, and off by a factor of 10? I change the gradient function to follow suit:

    Kannisto.gr2 <- function(pars, .Dx, .Exp, x = .5:30.5){
      a <- exp(pars["a"])
      b <- exp(pars["b"])
      d.a <- (a * exp(b * x) * .Exp + (-a * exp(b * x) - 1) * .Dx) /
        (a ^ 3 * exp(2 * b * x) + 2 * a ^ 2 * exp(b * x) + a)
      d.b <- (a * x * exp(b * x) * .Exp + (-a * x * exp(b * x) - x) * .Dx) /
        (a ^ 2 * exp(2 * b * x) + 2 * a * exp(b * x) + 1)
      colSums(cbind(a=d.a,b=d.b), na.rm = TRUE) / 10
    }
    Kannisto.gr2(log(c(a=.1,b=.1)), .Dx, .Exp)
    # same as numerical:
      a         b 
    -14477.40  -7458.34 

在优化器中尝试:

    BFGSab <- optim(log(c(a = .1, b = .1)), 
      fn = KannistoLik1, gr = Kannisto.gr2, method = "BFGS",
      .Dx = .Dx, .Exp = .Exp)
    # not reasonable results:
    exp(BFGSab$par) 
      a   b 
    Inf Inf 
    # and in fact, when not exp()'d, they look oddly familiar:
    BFGSab$par
      a         b 
    -14477.40  -7458.34 

根据Vincent的回答,我重新缩放了梯度函数,并使用abs()而不是exp()来保持参数为正.最新的,性能更好的物镜和梯度函数:

Following Vincent's answer, I rescaled the gradient function, and used abs() instead of exp() to keep parameters positive. The most recent, and better performing objective and gradient functions:

    KannistoLik2 <- function(pars, .Dx, .Exp, .x. = .5:30.5){
      mu <- KannistoMu.c(abs(pars), x = .x.)
      # take negative and minimize it (default optimizer behavior)
      -sum(.Dx * log(mu) - .Exp * mu, na.rm = TRUE) 
    }

    # gradient, to be down-scaled in `optim()` call
    Kannisto.gr3 <- function(pars, .Dx, .Exp, x = .5:30.5){
      a <- abs(pars["a"])
      b <- abs(pars["b"])
      d.a <- (a * exp(b * x) * .Exp + (-a * exp(b * x) - 1) * .Dx) /
        (a ^ 3 * exp(2 * b * x) + 2 * a ^ 2 * exp(b * x) + a)
      d.b <- (a * x * exp(b * x) * .Exp + (-a * x * exp(b * x) - x) * .Dx) /
        (a ^ 2 * exp(2 * b * x) + 2 * a * exp(b * x) + 1)
      colSums(cbind(a = d.a, b = d.b), na.rm = TRUE) 
    }

    # try it out:
    BFGSab2 <- optim(
      c(a = .1, b = .1), 
      fn = KannistoLik2, 
      gr = function(...) Kannisto.gr3(...) * 1e-7, 
      method = "BFGS",
      .Dx = .Dx, .Exp = .Exp
    )
    # reasonable:
    BFGSab2$par
            a         b 
    0.1243249 0.1163924 

    # better:
    KannistoLik2(exp(NMab1$par),.Dx = .Dx, .Exp = .Exp) > KannistoLik2(BFGSab2$par,.Dx = .Dx, .Exp = .Exp)
    [1] TRUE

此问题的解决速度比我预期的要快得多,而且我学到了很多技巧.谢谢文森特!

This was solved much faster than I was expecting, and I learned more than a couple tricks. Thanks Vincent!

推荐答案

要检查渐变是否正确, 您可以将其与数字近似值进行比较:

To check if the gradient is correct, you can compare it with a numeric approximation:

library(numDeriv); 
grad( function(u) KannistoLik1( c(a=u[1], b=u[2]), .Dx, .Exp ), c(1,1) ); 
Kannisto.gr(c(a=1,b=1), .Dx, .Exp)

迹象不对:算法没有任何改善 当它朝这个方向移动时,因此不会移动.

The signs are wrong: the algorithm does not see any improvement when it moves in this direction, and therefore does not move.

您可以使用某些计算机代数系统(此处为Maxima) 为您做计算:

You can use some computer algebra system (here, Maxima) to do the computations for you:

display2d: false;
f(a,b,x) := a * exp(b*x) / ( 1 + a * exp(b*x) );
l(a,b,d,e,x) := - d * log(f(a,b,x)) + e * f(a,b,x);
factor(diff(l(exp(a),exp(b),d,e,x),a));
factor(diff(l(exp(a),exp(b),d,e,x),b));

我只是将结果复制并粘贴到R中:

I just copy and paste the result into R:

f_gradient <- function(u, .Dx, .Exp, .x.=.5:30.5) {
  a <- u[1]
  b <- u[1]
  x <- .x.
  d <- .Dx
  e <- .Exp
  c(
    sum( (e*exp(exp(b)*x+a)-d*exp(exp(b)*x+a)-d)/(exp(exp(b)*x+a)+1)^2 ),
    sum( exp(b)*x*(e*exp(exp(b)*x+a)-d*exp(exp(b)*x+a)-d)/(exp(exp(b)*x+a)+1)^2 )
  )  
}

library(numDeriv)
grad( function(u) KannistoLik1( c(a=u[1], b=u[2]), .Dx, .Exp ), c(1,1) )
f_gradient(c(a=1,b=1), .Dx, .Exp)  # Identical

如果您盲目地将梯度放在优化中, 有一个数字不稳定性问题:给出的解决方案是(Inf,Inf) ... 为防止这种情况,您可以重新缩放渐变 (一种更好的解决方法是使用比指数爆炸性更小的爆炸性变形, 以确保参数保持正数.

If you blindly put the gradient in the optimization, there is a numeric instability problem: the solution given is (Inf,Inf)... To prevent it, you can rescale the gradient (a better workaround would be to use a less explosive transformation than the exponential, to ensure that the parameters remain positive).

BFGSab <- optim(
  log(c(a = .1, b = .1)), 
  fn = KannistoLik1, 
  gr = function(...) f_gradient(...) * 1e-3, 
  method = "BFGS",
  .Dx = .Dx, .Exp = .Exp
)
exp(BFGSab$par) # Less precise than Nelder-Mead

这篇关于如何正确指定要在optim()或其他优化程序中使用的梯度函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆