纵向样条数据的三次样条法? [英] Cubic spline method for longitudinal series data?

查看:210
本文介绍了纵向样条数据的三次样条法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的串行数据格式如下:

 时间牛奶Animal_ID 
30 25.6 1
31 27.2 1
32 24.4 1
33 17.4 1
34 33.6 1
35 25.4 1
33 29.4 2
34 25.4 2
35 24.7 2
36 27.4 2
37 22.4 2
80 24.6 3
81 24.5 3
82 23.5 3
83 25.5 3
84 24.4 3
85 23.4 3
。 。 。通常,有300只动物在短期的不同时间点都有牛奶记录。但是,如果我们将它们的数据合并在一起,而不关心不同的animal_ID,我们将在milk〜time之间形成一条曲线,如下图所示:

另外,在上图中,我们有1个示例动物的数据,它们是短且高度可变的。我的目的是使每个动物数据变得平滑,但是如果模型允许从整个数据中学习一般模式,那将是很顺利的。我使用以下格式的不同平滑模型(ns,bs,smooth.spline),但它不起作用:

  mod <-lme(牛奶〜bs(时间,df = 3),data = dat,随机=〜1 | Animal_ID)

我希望有人已经解决了这个问题。谢谢
可以从这里访问完整的数据集:
https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0

解决方案

我建议您使用 mgcv 软件包。这是推荐的R包之一,它执行一类称为广义加性混合模型的模型。您只需通过 library(mgcv)加载它。这是一个非常强大的库,可以处理最简单的线性回归模型,广义线性模型,加性模型,广义加性模型以及具有混合效应(固定效应+随机效应)的模型。您可以通过

  ls列出 mgcv 的所有(导出)函数。  package:mgcv)

您可以看到其中有很多。



对于您的特定数据和问题,可以使用具有以下公式的模型:

 模型<-牛奶〜s(时间,bs ='cr',k = 100)+ s(Animal_ID,bs ='re')

mgcv 中, s()是平滑函数的设置,表示为 bs 暗示的样条基础。 cr是三次样条曲线的基础,这正是您想要的。 k 是结数。应该根据数据集中变量 time 的唯一值的数量来选择它。如果将 k 恰好设置为该数字,则最终会得到平滑样条曲线;而任何小于该值的值都表示回归样条。但是,两者都会受到处罚(如果您知道处罚的含义)。我在以下位置读取了您的数据:

  dat<-na.omit(read.csv( data.txt,标头= TRUE))##我将您的数据保存到文件 data.txt 
dat $ Animal_ID<-factor(dat $ Animal_ID)
nrow(dat)## 12624观测值
长度( unique(dat $ time))## 157个独特的时间点
长度(ID<-level(dat $ Animal_ID))## 355头母牛

有157个唯一值,所以我认为 k = 100 是合适的。



对于 Animal_ID (强制为一个因子),我们需要一个随机项的模型项。 re是i.i.d随机效应的特殊类别。由于某些内部矩阵构造原因,它被传递给 bs (因此,这不是平滑函数!)。



现在,要适合GAM模型,您可以调用旧版 gam 或不断发展的 bam (大数据游戏) 。我认为您会使用后者。它们具有与 lm glm 类似的调用约定。例如,您可以执行以下操作:

  fit<-bam(model,data = dat,family = gaussian,离散= TRUE,nthreads = 2)

如您所见, bam 允许通过 nthreads 进行多核并行计算。虽然离散是一项新开发的功能,可以加快矩阵的形成。



由于您要处理时间序列数据,最后您可能会考虑一些时间自相关。 mgcv 允许配置AR1相关,相关系数由 bam 自变量 rho 。但是,您需要一个额外的索引 AR_start 来告诉 mgcv 时间序列如何分解。例如,当到达不同的 Animal_ID 时, AR_start 获得 TRUE 表示时间序列的新部分。有关详细信息,请参见?bam



mgcv 还提供


  1. summary.gam 函数用于模型汇总

  2. gam.check 用于基本模型检查

  3. plot.gam 绘制各个术语的函数

  4. predict.gam (或 predict.bam )以预测新数据。

例如,上述建议模型的摘要为:

 >摘要(适合)

家庭:高斯
链接函数:身份

公式:
milk〜s(time,bs = cr,k = 100)+ s(Animal_ID,bs = re)

参数系数:
估计标准值。误差t值Pr(> | t |)
(截取)26.1950 0.2704 96.89< 2e-16 ***
---
符号。代码:0'***'0.001'**'0.01'*'0.05'。'0.1''1

平滑词的近似含义:
edf Ref.df F p-价值
s(time)10.81 13.67 5.908 1.99e-11 ***
s(Animal_ID)351.43 354.00 136.449< 2e-16 ***
---
签名。代码:0'***'0.001'**'0.01'*'0.05'。'0.1''1

R-sq。(adj)= 0.805解释的偏差= 81.1%
fREML = 29643规模估算= 5.5681 n = 12624

edf (有效自由度)可被视为非线性程度的度量。因此,我们放入 k = 100 ,最后得到 edf = 10.81 这表明样条 s(时间)已受到严重惩罚。您可以查看 s(时间)看起来像是这样:

  plot.gam(fit,page = 1)

请注意,随机效果 s(Animal_ID)也具有平滑,这是特定于母牛的常数。对于随机效应,将返回高斯QQ图。



返回的诊断数字 invisible(gam.check(fit))

看起来不错,所以我认为该模型是可以接受的(我不为您提供模型选择,因此,如果您认为有,请考虑使用更好的模型)。



如果要对 Animal_ID = 26 ,您可以这样做

 已更新<-data.frame(时间= 1:150,Animal_ID = 26)
oo< -predict.gam(fit,newd,type =`link`,se.fit = TRUE)

请注意




  • 您需要在 newd (否则 mgcv 抱怨缺少变量)

  • 因为只有一个样条曲线平滑 s(时间),随机效应项 s(Animal_ID)是每个的常数Animal_ID 。因此可以将 type ='link'用于单个预测。顺便说一下, type ='terms' type ='link'慢。



如果要对多头母牛进行预测,请尝试以下操作:

  pred.ID<-ID [1:10] ##预测前10头牛
newd<-data.frame(时间= rep(1:150,时间= n) ,Animal_ID = factor(rep(pred.ID,每个= 150)))
oo< -predict.bam(fit,newd,type = link,se.fit = TRUE)

请注意




  • 我有在这里使用 predict.bam ,因为现在我们有 150 * 10 = 1500 数据点可以进行预测。另外:我们需要 se.fit = TRUE 。这相当昂贵,因此使用 predict.bam predict.gam 快。特别是,如果您使用 bam(...,离散= TRUE)拟合模型,则可以具有 predict.bam(...,离散= TRUE)。预测过程需要进行与模型拟合相同的矩阵形成步骤(请参见模型拟合中使用的?smoothCon ?PredictMat 如果您希望了解 mgcv 的更多内部结构,请在预测中使用。)

  • 我指定了 Animal_ID 作为因素,因为这是随机效应。



有关 mgcv的更多信息,可以参考库手册。特别检查?mgcv ?gam ?bam ?s






最终更新



尽管我说过我不会在模型部分为您提供帮助,但我认为这种模型更好(它提供了更高的 adj-Rsquared ),并且在意义上也更合理:

 模型<-牛奶〜s(time,bs = 'cr',k = 20)+ s(动物ID,bs ='re')+ s(动物ID,时间,bs ='re')

最后一项是施加随机斜率。这意味着我们假设每头奶牛的乳汁生长/减少模式不同。这是您问题中的更明智的假设。仅具有随机截距的早期模型是不够的。添加此随机斜率后,平滑项 s(time)看起来更平滑。这是一个好兆头,不是一个坏兆头,因为我们想对 s(time)做一些简单的解释,不是吗?比较从两种模型中得到的 s(时间),看看发现了什么。



我也减少了 k = 100 k = 20 。正如我们在之前的拟合中所见,此术语的 edf 约为10,因此 k = 20 足够。


I have a serial data formatted as follows:

time    milk    Animal_ID
30      25.6    1
31      27.2    1
32      24.4    1
33      17.4    1
34      33.6    1
35      25.4    1
33      29.4    2
34      25.4    2
35      24.7    2
36      27.4    2
37      22.4    2
80      24.6    3
81      24.5    3
82      23.5    3
83      25.5    3
84      24.4    3
85      23.4    3
.   .   .

Generally, 300 animals have records of milk in different time points of short period. However, if we join their data together and do not care about different animal_ID, we would have a curve between milk~time like this, the line in figure below: Also, in the above figure, we have data for 1 example animal, they are short and highly variable. My purposed is to smooth each animal data but it would be would if the model allows learning general patter from whole data to be included. I used different smooth model (ns, bs, smooth.spline) with the following format but it just did not work:

mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID)

I am hoping if somebody has already dealt with this problem would give me an advice. Thanks The full dataset can be accessed from here: https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0

解决方案

I would suggest you use mgcv package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv). This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv via

ls("package:mgcv")

And you can see there are many of them.

For your specific data and problem, you may use a model with formula:

model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')

In mgcv, s() is a setup for smooth functions, represented by spline basis implied by bs. "cr" is the cubic spline basis, which is exactly what you want. k is the number of knots. It should be chosen depending on the number of unique values of variable time in your data set. If you set k to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:

dat <- na.omit(read.csv("data.txt", header = TRUE))  ## I saved you data into file "data.txt"
dat$Animal_ID <- factor(dat$Animal_ID)
nrow(dat)  ## 12624 observations
length(unique(dat$time))  ## 157 unique time points
length(ID <- levels(dat$Animal_ID))  ## 355 cows

There are 157 unique values, so I reckon k = 100 is possibly appropriate.

For Animal_ID (coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs for some internal matrix construction reason (so this is not a smooth function!).

Now to fit a GAM model, you can call the legacy gam or the constantly developing bam (gam for big data). I think you will use the latter. They have the same calling convention similar to lm and glm. For example, you can do:

fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)

As you can see, bam allows multi-core parallel computation via nthreads. While discrete is a newly developed feature which can speed up matrix formation.

Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv allows configuration of AR1 correlation, whose correlation coefficient is passed by bam argument rho. However, you need an extra index AR_start to tell mgcv how the time series breaks up into pieces. For example, when reaching a different Animal_ID, AR_start get a TRUE to indicate a new segment of time series. See ?bam for details.

mgcv also provides

  1. summary.gam function for model summary
  2. gam.check for basic model checking
  3. plot.gam function for plotting individual terms
  4. predict.gam (or predict.bam) for prediction on new data.

For example, the summary of the above suggested model is:

> summary(fit)

Family: gaussian 
Link function: identity 

Formula:
milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")

Parametric coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.1950     0.2704   96.89   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                edf Ref.df       F  p-value    
s(time)       10.81  13.67   5.908 1.99e-11 ***
s(Animal_ID) 351.43 354.00 136.449  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.805   Deviance explained = 81.1%
fREML =  29643  Scale est. = 5.5681    n = 12624

The edf (effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100, while ending up with edf = 10.81. This suggest that the spline s(time) has been heavily penalized. You can view the what s(time) looks like by:

plot.gam(fit, page = 1)

Note that the random effect s(Animal_ID) also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.

The diagnostic figures returned by

invisible(gam.check(fit))

looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).

If you want to make prediction for Animal_ID = 26, you may do

newd <- data.frame(time = 1:150, Animal_ID = 26)
oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)

Note that

  • You need to include both variables in newd (otherwise mgcv complains missing variable)
  • since you have only one spline smooth s(time), and the random effect term s(Animal_ID) is a constant per Animal_ID. so it is OK to use type = 'link' for individual prediction. By the way, type = 'terms' is slower than type = 'link'.

If you want to make prediction for more than one cows, try something like this:

pred.ID <- ID[1:10]  ## predict first 10 cows
newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)

Note that

  • I have used predict.bam here, as now we have 150 * 10 = 1500 data points to predict. Plus: we require se.fit = TRUE. This is rather expensive, so use predict.bam is faster than predict.gam. Particularly, if you have fitted your model using bam(..., discrete = TRUE), you can have predict.bam(..., discrete = TRUE). Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon used in model fitting and ?PredictMat used in prediction, if you are keen to know more internal structure of mgcv.)
  • I specified Animal_ID as factors, because this is a random effect.

For more on mgcv, you can refer to library manual. Check specially ?mgcv, ?gam, ?bam ?s.


Final update

Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared) and is also more reasonable in sense:

model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')

The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time) looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time), don't we? Compare the s(time) you get from both models, and see what you discover.

I have also reduced k = 100 to k = 20. As we saw in previous fit, the edf for this term is about 10, so k = 20 is pretty sufficient.

这篇关于纵向样条数据的三次样条法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆