R如何按组滞后数据帧 [英] R How to lag a dataframe by groups

查看:109
本文介绍了R如何按组滞后数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集:

Name  Year  VarA  VarB Data.1  Data.2
A     2016  L     H    100     101
A     2017  L     H    105     99
A     2018  L     H    103     105
A     2016  L     A    90      95
A     2017  L     A    99      92
A     2018  L     A    102     101

我想按以下分组添加滞后变量:Name,VarA,VarB,这样我的数据看起来像:

I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:

Name  Year  VarA  VarB Data.1  Data.2  Lg1.Data.1 Lg2.Data.1
A     2016  L     H    100     101     NA        NA
A     2017  L     H    105     99      100       NA
A     2018  L     H    103     105     105       100
A     2016  L     A    90      95      NA        NA
A     2017  L     A    99      92      90        NA
A     2018  L     A    102     101     99        90

我发现以下链接非常有用: debugging:用于为多个列(dplyr)创建多个滞后的功能

I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)

并且正在使用以下代码:

And am using the following code:

df <- df %>% 
  group_by(Name) %>% 
  arrange(Name, VarA, VarB, Year) %>% 
  do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))

但是,滞后时间抵消了所有与名称相关联的数据,而不是我想要的分组,因此只有2018年准确地滞后了。

However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.

Name  Year  VarA  VarB Data.1  Data.2  Lg1.Data.1 Lg2.Data.1
A     2016  L     H    100     101     NA        NA
A     2017  L     H    105     99      100       NA
A     2018  L     H    103     105     105       100
A     2016  L     A    90      95      103       105
A     2017  L     A    99      92      90        103
A     2018  L     A    102     101     99        90

如何获得每个新重置的滞后时间分组组合(例如Name / VarA / VarB)?

How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?

推荐答案

dplyr :: lag 可让您设置要滞后的距离。您可以按所需的任何变量分组-在这种情况下,名称 VarA VarB -在创建滞后变量之前。

dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.

library(dplyr)

df %>%
  group_by(Name, VarA, VarB) %>%
  mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups:   Name, VarA, VarB [2]
#>   Name   Year VarA  VarB  Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#>   <chr> <dbl> <chr> <chr>  <dbl>  <dbl>      <dbl>      <dbl>
#> 1 A      2016 L     H        100    101         NA         NA
#> 2 A      2017 L     H        105     99        100         NA
#> 3 A      2018 L     H        103    105        105        100
#> 4 A      2016 L     A         90     95         NA         NA
#> 5 A      2017 L     A         99     92         90         NA
#> 6 A      2018 L     A        102    101         99         90

如果您想要一个可扩展到更大滞后的版本,则可以使用一些非标准评估来动态创建新的滞后列。我将使用 purrr :: map 进行此操作,以迭代一组 n 滞后的列表,数据框添加新列,然后将所有数据框连接在一起。 NSE可能有更好的方法,所以希望有人可以对此进行改进。

If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.

我正在编写一些新数据,以期具有更广泛的范围为了显示。在 mutate 内,您可以使用 quo_name 创建列名。

I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.

library(dplyr)
library(purrr)

set.seed(127)
df <- tibble(
  Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
  Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T) 
)

df_list <- purrr::map(1:4, function(i) {
  df %>%
    group_by(Name, VarA, VarB) %>%
    mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})

您不需要保存此列表-我只是在做它来显示其中一个数据框的示例。相反,您可以直接进入 reduce

You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.

df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups:   Name, VarA, VarB [2]
#>    Name   Year VarA  VarB  Data.1 Data.2  Lag3
#>    <chr> <int> <chr> <chr>  <int>  <int> <int>
#>  1 A      2016 L     H          3      9    NA
#>  2 A      2017 L     H          1      4    NA
#>  3 A      2018 L     H          3      8    NA
#>  4 A      2019 L     H          2      2     3
#>  5 A      2020 L     H          4      5     1
#>  6 A      2016 L     A          8      4    NA
#>  7 A      2017 L     A          6      8    NA
#>  8 A      2018 L     A          3      2    NA
#>  9 A      2019 L     A          8      6     8
#> 10 A      2020 L     A          9      1     6

然后使用 purrr :: reduce 加入列表中的所有数据框。由于每个数据帧中都有相同的列,并且这些列是您要加入的列,因此您可以不必在 inner_join

Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.

reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups:   Name, VarA, VarB [?]
#>    Name   Year VarA  VarB  Data.1 Data.2  Lag1  Lag2  Lag3  Lag4
#>    <chr> <int> <chr> <chr>  <int>  <int> <int> <int> <int> <int>
#>  1 A      2016 L     H          3      9    NA    NA    NA    NA
#>  2 A      2017 L     H          1      4     3    NA    NA    NA
#>  3 A      2018 L     H          3      8     1     3    NA    NA
#>  4 A      2019 L     H          2      2     3     1     3    NA
#>  5 A      2020 L     H          4      5     2     3     1     3
#>  6 A      2016 L     A          8      4    NA    NA    NA    NA
#>  7 A      2017 L     A          6      8     8    NA    NA    NA
#>  8 A      2018 L     A          3      2     6     8    NA    NA
#>  9 A      2019 L     A          8      6     3     6     8    NA
#> 10 A      2020 L     A          9      1     8     3     6     8

创建于2018-12-07由 reprex软件包(v0.2.1)

Created on 2018-12-07 by the reprex package (v0.2.1)

这篇关于R如何按组滞后数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆