为什么来自mgcv的bam对于某些数据比较慢? [英] Why is bam from mgcv slow for some data?

查看:119
本文介绍了为什么来自mgcv的bam对于某些数据比较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用mgcv中的bam函数将相同的通用加性模型拟合到多个数据集.虽然对于我的大多数数据集,拟合过程都在10到20分钟之间的合理时间内完成.对于一些数据集,运行需要10多个小时才能完成.我发现慢速情况之间没有任何相似之处,最终的拟合度既非优劣,也未包含任何明显的离群值.

I am fitting the same Generalized Additive Model on multiple data sets using the bam function from mgcv. While for most of my data sets the fit completes within a reasonable time between 10 and 20 minutes. For a few data sets the run take more than 10 hours to complete. I cannot find any similarities between the slow cases, the final fit is neither exceptionally good nor bad, nor do they contain any noticeable outliers.

我如何弄清楚为什么在这些情况下拟合如此之慢?以及我如何能够加快这些速度?

How can I figure out why the fit is so slow for these instances? And how might I be able to speed these up?

我的模型包含两个平滑项(使用循环三次样条曲线基础)以及一些其他数值和因子变量.总共估计了300个系数(包括平滑项的系数).我故意将结数保持在理论上最理想的信息以下,以加快拟合过程.我的数据集包含约85万行.

My model contains two smooth terms (using a cyclic cubic spline basis) and some additional numerical and factor variables. In total 300 coefficients (including those for smooth terms) are estimated. I keep the number of knots intentionally below information theoretically optimal numbers to speed up the fitting process. My data sets contain around 850k rows.

这是函数调用:

bam(
    value
    ~ 0
    + weekday_x
    + weekday
    + time
    + "a couple of factor variables encoding special events"
    + delta:weekday
    + s(share_of_year, k=length(knotsYear), bs="cc")
    + s(share_of_year_x, k=length(knotsYear), bs="cc")
    , knots=list(
      share_of_year=knotsYear
      , share_of_year_x=knotsYear
    )
    , family=quasipoisson()
    , data=data
)

节数包含26节.

在大多数情况下,此模型收敛得很快,而在少数情况下,收敛得非常慢.

This model converges reasonably fast for most cases but incredibly slow for a few.

推荐答案

最可能的原因:"fREML"失败

在像上面这样的典型模型中,没有张量平滑teti,我的经验是REML迭代在某些情况下会失败.

A most likely cause: "fREML" failure

In a typical model like above, without tensor smooth te or ti, my experience is that REML iteration fails for some cases.

规范的bam实现使用fast.REML.fit.此例程的收敛性测试需要修复,但是由于Simon现在已经着手并更加关注discrete方法,因此他不希望修复它.固定版本(目前)仅在扩展包中提供,该扩展包用于测试浙源附加产品",并作为我博士论文的补充.但是fast.REML.fit也不是那么脆弱,并且这种收敛失败并不经常发生,否则一大堆重大报告将使该问题在2012年得到解决.

The canonical bam implementation uses fast.REML.fit. The convergence test of this routine needs a fix, but as Simon has moved on and focus more on the discrete method now, he is not keen on fixing it. The fixed version is (at the moment) only available in an extension pack for testing, "Zheyuan add-on", supplemented to my PhD thesis. But fast.REML.fit is also not that fragile, and such convergence failure is not often, otherwise piles of big report would get this issue fixed back in 2012.

以下内容可帮助您进行检查而不是修复.

The following just helps you make a check not a fix.

fit成为您需要花费10个小时的合适型号,请检查fit$outer.info.这给出了所需的REML迭代次数,以及诸如梯度和Hessian之类的收敛信息.如果看到iter = 200或任何信息,如步失败"之类的失败,那么您知道为什么要花这么长时间.但是,如果您看一下渐变,则很有可能会看到它几乎为零.换句话说,REML迭代实际上已经收敛,但是fast.REML.fit无法检测到它.

Let fit be your fitted model that takes 10 hours, check fit$outer.info. This gives number of REML iterations it takes, and the convergence information like gradient and Hessian. If you see iter = 200, or any information saying some failure like "step failed", then you know why it takes so long. But if you look at gradient, most likely you will see that it is almost zero. In other words, REML iteration has actually converged but fast.REML.fit fails to detect it.

由于您要拟合GAM而不是AM(具有高斯响应的加性模型),因此REML迭代之外还有另一个P-IRLS(经迭代迭代的加权最小二乘法).是的,整个(标准)bam估计是一个双循环嵌套,称为性能迭代".这也可能会失败,但是这种失败是更固有的,并且可能无法克服,因为不能保证性能迭代"会收敛.因此,请检查fit$iter是否也很大(在最坏的情况下,它可能是200). mgcv手册的专用部分?gam.convergence讨论了这种类型的收敛失败,这是导致需要外部迭代"的驱动原因.但是,对于大型数据集,外部迭代"实施起来过于昂贵.因此,请忍受性能迭代".

Since you are fitting a GAM not an AM (additive model with Gaussian response), there is another P-IRLS (penalized iteratively re-weighed least squares) outside the REML iteration. Yes, the whole (canonical) bam estimation is a double loop nest, called "performance iteration". This may fail, too, but such failure is more intrinsic and may not be overcome, as the "performance iteration" is not guaranteed to converge. So, check fit$iter to see if it is also very big (in the worst case it can be 200). mgcv manual has a dedicated section ?gam.convergence discussing this type of convergence failure, and it is the driving reason that "outer iteration" is desirable. However, for large dataset, "outer iteration" is too expensive to implement. So, bear with "performance iteration".

mgcv具有跟踪"选项.如果在调用bam时设置了control = gam.control(trace = TRUE),则随着性能迭代"的进行,偏差信息和迭代计数器将被打印到屏幕上.这为您提供了一条清晰的路线,可以了解受罚的偏差,因此您可以检查它是在周围循环还是被困在某个位置.这比fit$iter中存储的单个迭代编号更具信息性.

mgcv has a "tracing" option. If you set control = gam.control(trace = TRUE) when calling bam, the deviance information and iteration counter will be printed to screen as "performance iteration" goes. This gives you a clear path of the penalized deviance, so you can inspect whether it is cycling around or trapped at some point. This is more informative than a single iteration number stored in fit$iter.

也许您想尝试新的discrete = TRUE(于2015年添加;论文于2017年正式发布).它使用了新的拟合迭代.与旧方法相比,我们(MUCH)更愿意测试其实际收敛能力.使用时,也请打开跟踪".如果无法收敛,可以考虑报告它,但是我们需要一个可重现的案例.

Maybe you want to try the new discrete = TRUE (added in 2015; paper formally published in 2017). It uses a new fitting iteration. We are (MUCH) happier to test its practical convergence capability, than the old method. When using it, turn on "trace", too. If it fails to converge, think about reporting it, but we need a reproducible case.

这篇关于为什么来自mgcv的bam对于某些数据比较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆