使用 dplyr 和 for 循环添加多个滞后变量 [英] Adding multiple lag variables using dplyr and for loops

查看：13 发布时间：2021/12/23 12:29:53 r dplyr

本文介绍了使用 dplyr 和 for 循环添加多个滞后变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有预测的时间序列数据，因此我创建了滞后变量以用于我的统计分析.我想要一种在给定特定输入的情况下创建多个变量的快速方法，以便我可以轻松地交叉验证和比较模型.

I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.

以下示例代码为给定特定类别(A、B、C)的 2 个不同变量(共 4 个)添加 2 个滞后:

The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):

# Load dplyr
library(dplyr)

# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))

# create data frame
data = data.frame(days, cats, values1, values2)

# mutate new lag variables 
LagVal = data %>% arrange(days) %>% group_by(cats) %>% 
  mutate(LagVal1.1 = lag(values1, 1)) %>%
  mutate(LagVal1.2 = lag(values1, 2)) %>%
  mutate(LagVal2.1 = lag(values2, 1)) %>%
  mutate(LagVal2.2 = lag(values2, 2))

LagVal

       days   cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
  <int> <fctr>   <dbl>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1      A      16     -10        NA        NA        NA        NA
2     2      B      14      24        NA        NA        NA        NA
3     3      C      16      -6        NA        NA        NA        NA
4     4      A      12      25        16        NA       -10        NA
5     5      B      20      14        14        NA        24        NA
6     6      C      18      -5        16        NA        -6        NA
7     7      A      21       2        12        16        25       -10
8     8      B      19       5        20        14        14        24
9     9      C      18      -3        18        16        -5        -6

我的问题出现在 # mutate new lag variables 步骤，因为我有大约十几个预测变量，我可能想要滞后 10 倍(~13k 行数据集)，并且我没有心去创建 120 个新变量.

My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.

这是我尝试编写一个函数，该函数根据 data(要变异的数据集)、variables(您希望滞后的变量)的输入来改变新变量，和 lags(每个变量的滞后数):

Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable):

MultiMutate = function(data, variables, lags){
  # select the data to be working with
  FuncData = data
  # Loop through desired variables to mutate
  for (i in variables){
    # Loop through number of desired lags
    for (u in 1:lags){
      FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
        # Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
        mutate(paste(i, u) = lag(i, u))
    }
  }
  FuncData
}

老实说，我对如何让它发挥作用有点迷茫.我的 for 循环和整体逻辑的顺序是有道理的，但函数将字符转换为变量的方式和整体语法似乎很遥远.有没有简单的方法来修复这个函数以获得我想要的结果?

To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?

特别是，我正在寻找:

像 MultiMutate(data = data, variables = c(values1, values2), lags = 2) 这样的函数可以创建 LagVal 的确切结果从上面.

A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above.

根据变量及其滞后动态命名变量.IE.value1.1、value1.2、value2.1、value2.2等

Dynamically naming the variables based on the variable and their lag. I.e. value1.1, value1.2, value2.1, value2.2, etc.

提前致谢，如果您需要更多信息，请告诉我.如果有一种更简单的方法来获得我正在寻找的东西，那么我全神贯注.

Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.

推荐答案

您必须深入到 tidyverse 工具箱中才能一次性添加所有内容.如果为 cats 的每个值嵌套数据，则可以迭代嵌套的数据框，迭代每个中的 values* 列的滞后.

You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats, you can iterate over the nested data frames, iterating the lags over the values* columns in each.

library(tidyverse)
set.seed(47)

df <- data_frame(days = 1:9,
                 cats = rep(c('A','B','C'),3),
                 values1 = round(rnorm(9, 16, 4)),
                 values2 = round(rnorm(9, 16, 16)))


df %>% nest(-cats) %>% 
    mutate(lags = map(data, function(dat) {
        imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x), 
                                     paste0(.y, '_lag', 1:2)))
        })) %>% 
    unnest() %>% 
    arrange(days)
#> # A tibble: 9 x 8
#>   cats   days values1 values2 values1_lag1 values1_lag2 values2_lag1
#>   <chr> <int>   <dbl>   <dbl>        <dbl>        <dbl>        <dbl>
#> 1 A         1     24.     -7.          NA           NA           NA 
#> 2 B         2     19.      1.          NA           NA           NA 
#> 3 C         3     17.     17.          NA           NA           NA 
#> 4 A         4     15.     24.          24.          NA           -7.
#> 5 B         5     16.    -13.          19.          NA            1.
#> 6 C         6     12.     17.          17.          NA           17.
#> 7 A         7     12.     27.          15.          24.          24.
#> 8 B         8     16.     15.          16.          19.         -13.
#> 9 C         9     15.     36.          12.          17.          17.
#> # ... with 1 more variable: values2_lag2 <dbl>

data.table::shift 使这更简单，因为它是矢量化的.命名比实际滞后需要更多的工作:

data.table::shift makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:

library(data.table)

setDT(df)

df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2), 
   by = cats, .SDcols = values1:values2][]
#>    days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1:    1    A      24      -7           NA           NA           NA
#> 2:    2    B      19       1           NA           NA           NA
#> 3:    3    C      17      17           NA           NA           NA
#> 4:    4    A      15      24           24           NA           -7
#> 5:    5    B      16     -13           19           NA            1
#> 6:    6    C      12      17           17           NA           17
#> 7:    7    A      12      27           15           24           24
#> 8:    8    B      16      15           16           19          -13
#> 9:    9    C      15      36           12           17           17
#>    values2_lag2
#> 1:           NA
#> 2:           NA
#> 3:           NA
#> 4:           NA
#> 5:           NA
#> 6:           NA
#> 7:           -7
#> 8:            1
#> 9:           17

这篇关于使用 dplyr 和 for 循环添加多个滞后变量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 dplyr 和 for 循环添加多个滞后变量 [英] Adding multiple lag variables using dplyr and for loops

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 dplyr 和 for 循环添加多个滞后变量 [英] Adding multiple lag variables using dplyr and for loops

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭