使用 dplyr 和 for 循环添加多个滞后变量 [英] Adding multiple lag variables using dplyr and for loops
问题描述
我有预测的时间序列数据,因此我创建了滞后变量以用于我的统计分析.我想要一种在给定特定输入的情况下创建多个变量的快速方法,以便我可以轻松地交叉验证和比较模型.
I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.
以下示例代码为给定特定类别(A、B、C)的 2 个不同变量(共 4 个)添加 2 个滞后:
The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):
# Load dplyr
library(dplyr)
# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))
# create data frame
data = data.frame(days, cats, values1, values2)
# mutate new lag variables
LagVal = data %>% arrange(days) %>% group_by(cats) %>%
mutate(LagVal1.1 = lag(values1, 1)) %>%
mutate(LagVal1.2 = lag(values1, 2)) %>%
mutate(LagVal2.1 = lag(values2, 1)) %>%
mutate(LagVal2.2 = lag(values2, 2))
LagVal
days cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
<int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 16 -10 NA NA NA NA
2 2 B 14 24 NA NA NA NA
3 3 C 16 -6 NA NA NA NA
4 4 A 12 25 16 NA -10 NA
5 5 B 20 14 14 NA 24 NA
6 6 C 18 -5 16 NA -6 NA
7 7 A 21 2 12 16 25 -10
8 8 B 19 5 20 14 14 24
9 9 C 18 -3 18 16 -5 -6
我的问题出现在 # mutate new lag variables
步骤,因为我有大约十几个预测变量,我可能想要滞后 10 倍(~13k 行数据集),并且我没有心去创建 120 个新变量.
My problem comes in at the # mutate new lag variables
step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.
这是我尝试编写一个函数,该函数根据 data
(要变异的数据集)、variables
(您希望滞后的变量)的输入来改变新变量,和 lags
(每个变量的滞后数):
Here is my attempt at writing a function which mutates new variables given the inputs for data
(dataset to mutate), variables
(the variables you wish to lag), and lags
(the number of lags per variable):
MultiMutate = function(data, variables, lags){
# select the data to be working with
FuncData = data
# Loop through desired variables to mutate
for (i in variables){
# Loop through number of desired lags
for (u in 1:lags){
FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
# Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
mutate(paste(i, u) = lag(i, u))
}
}
FuncData
}
老实说,我对如何让它发挥作用有点迷茫.我的 for 循环和整体逻辑的顺序是有道理的,但函数将字符转换为变量的方式和整体语法似乎很遥远.有没有简单的方法来修复这个函数以获得我想要的结果?
To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?
特别是,我正在寻找:
像
MultiMutate(data = data, variables = c(values1, values2), lags = 2)
这样的函数可以创建LagVal
的确切结果从上面.
A function like
MultiMutate(data = data, variables = c(values1, values2), lags = 2)
that would create the exact result ofLagVal
from above.
根据变量及其滞后动态命名变量.IE.value1.1、value1.2、value2.1、value2.2等
Dynamically naming the variables based on the variable and their lag. I.e. value1.1, value1.2, value2.1, value2.2, etc.
提前致谢,如果您需要更多信息,请告诉我.如果有一种更简单的方法来获得我正在寻找的东西,那么我全神贯注.
Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.
推荐答案
您必须深入到 tidyverse 工具箱中才能一次性添加所有内容.如果为 cats
的每个值嵌套数据,则可以迭代嵌套的数据框,迭代每个中的 values*
列的滞后.
You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats
, you can iterate over the nested data frames, iterating the lags over the values*
columns in each.
library(tidyverse)
set.seed(47)
df <- data_frame(days = 1:9,
cats = rep(c('A','B','C'),3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16)))
df %>% nest(-cats) %>%
mutate(lags = map(data, function(dat) {
imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x),
paste0(.y, '_lag', 1:2)))
})) %>%
unnest() %>%
arrange(days)
#> # A tibble: 9 x 8
#> cats days values1 values2 values1_lag1 values1_lag2 values2_lag1
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 24. -7. NA NA NA
#> 2 B 2 19. 1. NA NA NA
#> 3 C 3 17. 17. NA NA NA
#> 4 A 4 15. 24. 24. NA -7.
#> 5 B 5 16. -13. 19. NA 1.
#> 6 C 6 12. 17. 17. NA 17.
#> 7 A 7 12. 27. 15. 24. 24.
#> 8 B 8 16. 15. 16. 19. -13.
#> 9 C 9 15. 36. 12. 17. 17.
#> # ... with 1 more variable: values2_lag2 <dbl>
data.table::shift
使这更简单,因为它是矢量化的.命名比实际滞后需要更多的工作:
data.table::shift
makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:
library(data.table)
setDT(df)
df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2),
by = cats, .SDcols = values1:values2][]
#> days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1: 1 A 24 -7 NA NA NA
#> 2: 2 B 19 1 NA NA NA
#> 3: 3 C 17 17 NA NA NA
#> 4: 4 A 15 24 24 NA -7
#> 5: 5 B 16 -13 19 NA 1
#> 6: 6 C 12 17 17 NA 17
#> 7: 7 A 12 27 15 24 24
#> 8: 8 B 16 15 16 19 -13
#> 9: 9 C 15 36 12 17 17
#> values2_lag2
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: -7
#> 8: 1
#> 9: 17
这篇关于使用 dplyr 和 for 循环添加多个滞后变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!