mgcv gam()错误:模型的系数比数据多 [英] mgcv gam() error: model has more coefficients than data

查看:557
本文介绍了mgcv gam()错误:模型的系数比数据多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为数据集使用GAM(广义加性模型).该数据集具有32个观测值,其中包含6个预测变量和一个响应变量(即幂). 我正在使用mgcv包的gam()函数来适应模型.每当我尝试拟合模型时,都会收到以下错误消息:

I am using GAM (generalized additive models) for my dataset. This dataset has 32 observations, with 6 predictor variables and a response variable (namely power). I am using gam() function of the mgcv package to fit the models. Whenever, I try to fit a model I do get an error message as:

Error in gam(formula.hh, data = data, na.action = na.exclude,  : 
  Model has more coefficients than data

从此错误消息中,我推断出与观察数相比,我有更多的预测变量.我猜这个错误是在交叉验证过程中产生的.有什么办法可以解决这个错误?

From this error message, I infer that I have more predictor variables as compared to the number of observations. I guess this error is generated during cross-validation procedures. Is there any way to handle this error?

我正在为此使用以下代码,

I am using following code for this,

library(mgcv)
formula.hh <- as.formula(power ~ s(temperature) 
                                + s(prevday1) + s(prevday2)
                                + s(prev_2_hour) + s(prev_instant1))
model <- gam(formula.hh, data = data, na.action = na.exclude)

在这里,我要使用dput()函数附加数据

Here, I am attaching the data with dput() function

> dput(data)
data <- structure(list(power = c(250.615931666667, 252.675878333333, 
1578.209605, 186.636575166667, 1062.07912666667, 1031.481235, 
1584.38902166667, 276.973836666667, 401.620463333333, 1622.50827666667, 
273.825153333333, 1511.37474333333, 291.460865, 215.138178333333, 
247.509348333333, 1140.21383833333, 1680.63441666667, 1742.44168333333, 
592.162706166667, 1610.7307, 615.857495, 1664.13551, 464.973065, 
1956.2482, 1767.94469333333, 1869.02678333333, 1806.731, 1746.3731, 
549.216605, 1425.42390166667, 1900.32575, 1766.18103333333), 
    temperature = c(31, 30, 28, 28, 27, 31, 32, 32, 30.5, 33, 
    33, 30, 32, 24, 30, 26, 28, 32, 34, 25, 32, 33, 35, 36, 36, 
    37, 35, 33, 35, 33, 35, 32), prevday1 = c(NA, 250.615931666667, 
    252.675878333333, 1578.209605, 186.636575166667, 1062.07912666667, 
    1031.481235, 1584.38902166667, 276.973836666667, 401.620463333333, 
    1622.50827666667, 273.825153333333, 1511.37474333333, 291.460865, 
    215.138178333333, 247.509348333333, 1140.21383833333, 1680.63441666667, 
    1742.44168333333, 592.162706166667, 1610.7307, 615.857495, 
    1664.13551, 464.973065, 1956.2482, 1767.94469333333, 1869.02678333333, 
    1806.731, 1746.3731, 549.216605, 1425.42390166667, 1900.32575
    ), prevday2 = c(NA, NA, 250.615931666667, 252.675878333333, 
    1578.209605, 186.636575166667, 1062.07912666667, 1031.481235, 
    1584.38902166667, 276.973836666667, 401.620463333333, 1622.50827666667, 
    273.825153333333, 1511.37474333333, 291.460865, 215.138178333333, 
    247.509348333333, 1140.21383833333, 1680.63441666667, 1742.44168333333, 
    592.162706166667, 1610.7307, 615.857495, 1664.13551, 464.973065, 
    1956.2482, 1767.94469333333, 1869.02678333333, 1806.731, 
    1746.3731, 549.216605, 1425.42390166667), prev_instant1 = c(NA, 
    237.211388333333, 455.932271666667, 367.837349666667, 1230.40137333333, 
    1080.74080166667, 1898.06056666667, 326.103031666667, 302.770571666667, 
    1859.65283333333, 281.700161666667, 1684.32288333333, 291.448878333333, 
    214.838578333333, 254.042623333333, 1380.14074333333, 824.437228333333, 
    1660.46284666667, 268.004111666667, 1715.02763333333, 1853.08503333333, 
    1821.31845, 1173.91945333333, 1859.87353333333, 1887.67635, 
    1760.29563333333, 1876.05421666667, 1743.10665, 366.382048333333, 
    1185.16379, 1713.98534666667, 1746.36006666667), prev_instant2 = c(NA, 
    275.55167, 242.638122833333, 220.635857, 1784.77271666667, 
    1195.45020333333, 590.114391666667, 310.141536666667, 1397.3184605, 
    1747.44398333333, 260.10318, 1521.77355833333, 283.317726666667, 
    206.678135, 231.428693833333, 235.600631666667, 232.455201666667, 
    281.422625, 256.470893333333, 1613.82088333333, 1564.34841666667, 
    1795.03498333333, 1551.64725666667, 1517.69289833333, 1596.66556166667, 
    2767.82433333333, 2949.38005, 328.691775, 389.83789, 1805.71815333333, 
    1153.97645666667, 1752.75968333333), prev_2_hour = c(NA, 
    219.024983, 313.393630708333, 263.748829166667, 931.193606666667, 
    699.399163791667, 754.018962083334, 272.22309625, 595.954508875, 
    1597.21487208333, 512.64361, 1236.42579666667, 281.200373333334, 
    196.983981666666, 230.327737625, 525.483920416666, 391.120302791667, 
    610.101280416667, 247.710625543785, 978.741044166665, 979.658926666667, 
    1189.25306041667, 814.840889166667, 989.059700416665, 1352.2367025, 
    1770.20417833333, 1847.11590666667, 843.191556416666, 363.50806625, 
    904.924465041666, 841.746712500002, 1747.73452958333)), .Names = c("power", 
"temperature", "prevday1", "prevday2", "prev_instant1", "prev_instant2", 
"prev_2_hour"), class = "data.frame", row.names = c(NA, 32L))

推荐答案

此数据集包含32个观测值.

This dataset has 32 observations.

实际上,只有30个两行包含NA.

Actually, only 30 as two rows have NA.

从此错误消息中,我推断出与观察数相比,我有更多的预测变量.

From this error message, I infer that I have more predictor variables as compared to the number of observations.

是的.默认情况下,对于一维平滑器,s()选择基本尺寸(或等级)为10,并提供10个原始参数.在居中约束之后(请参见?identifiability),您会得到少一个参数,但每个平滑度仍然有9个参数.请注意,您有5个平滑!因此,您有45个用于平滑项的参数以及一个截距.这大于您的30个数据.

Yes. By default, the s() choose basis dimension (or rank) to be 10 for 1D smoother, giving 10 raw parameters. After centering constraint (see ?identifiability) you get one fewer parameter, but you still have 9 parameters for each smooth. Note that you have 5 smooths! So you have 45 parameters for smooth terms, plus an intercept. This is greater than your 30 data.

我猜这个错误是在交叉验证过程中产生的.

I guess this error is generated during cross-validation procedures.

不.解释GAM公式并构建模型框架后,便会立即检测到此错误.甚至在进行实际基础构建之前,我们就已经知道什么是n(数据数量)和什么是p(参数数量).

No. This error is detected as soon as GAM formula has been interpreted and model frame been constructed. Even before real basis construction we can already know what is n (number of data) and what is p (number of parameters).

有什么办法可以解决这个错误?

Is there any way to handle this error?

手动减少k而不是使用默认值.但是,对于三次样条曲线,最小k为3.例如,使用s(temperature, bs = 'cr', k = 3).注意我还设置了bs = 'cr'使用自然三次样条,而不是薄板回归样条的默认bs = 'tp'.您当然可以使用它,但是对于一维平滑"cr"是自然的选择.

Reduce k manually rather than using default. However for cubic spline the minimum k is 3. For example, use s(temperature, bs = 'cr', k = 3). Note I have also set bs = 'cr' to use natural cubic spline, not the default bs = 'tp' for thin-plate regression spline. You can use it of course, but for 1D smooth "cr" is a natural choice.

这篇关于mgcv gam()错误:模型的系数比数据多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆