R中的递归函数,其终点随组而变化 [英] Recursive function in R with ending point varying by group

查看:32
本文介绍了R中的递归函数,其终点随组而变化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在dplyr更改中使用递归结构,该结构可对某些操作所使用的滞后次数进行迭代.事实是,我不确定如何设置其终点,因为它更像 while 而不是 for 循环,这使我有点不安全.

I wish to use a recursive structure in my dplyr change that iterates on the number of lags used on certain operations. The thing is that I am not sure how to set its ending point since it resembles more a while than a for loop, which makes me a bit insecure.

这是一些示例数据.组的大小不一定相同,并通过 id

Here is some sample data. Groups are not necessarily the same size and are indexed by id

df <- data.frame(id = c(1, 1, 1, 1, 2, 
                        2, 3, 4, 5, 5, 5), 
                  p201 = c(NA, NA, "001", NA, NA, NA, "001", "001", "001", NA, NA), 
                 V2009 = c(25, 11, 63, 75, 49, 14, 32, 31, 3, 10, 3),
                 ager = c(2.3, 2, 8.1, 12.1, 5.1, 2, 2.9, 2.8, 2,
                          2, 2), 
                 V2007 = c(1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1)
)

我希望根据给定组中观察到的延迟来更新 p201 .

I wish to update p201 according to how similar to its lags is an observation in a given group.

这是我在第一次迭代中要做的:

This is how I would do in a first iteration:

new <- df %>%
group_by(id) %>%
mutate(
    p201 = ifelse(!is.na(p201), p201,
                      ifelse(
                        V2007 == lag(V2007, 1) & 
                        abs(V2009 - lag(V2009, 1)) <= ager,
                        first(na.omit(p201)), p201)))

我的问题是我如何编写适合 dplyr 链的递归函数,该链在 lag(VAR,i) i 上进行迭代>-我希望 i 增长,直到两件事发生: p201 中不再有 NA ,并且在每组中尝试所有可能的滞后.关于后者,应该说每个组中的行数是变化的.

My question is how could I write a recursive function that fits in a dplyr chain that iterates on i in lag(VAR, i) - I want i to grow until either thing happens: there are no more NAs in p201 and all possible lags are tried in each group. Regarding the latter, it should be said that the number of rows in each group varies.

我考虑了两种可能性:将 i 的最大值设为最大组的行数-1或将其设为每组的行数-1.我不是确定哪种解决方案是最佳解决方案,我也不知道如何实现.

I thought about two possibilities: making the maximum value of i be the number of rows of the largest group - 1 or it being the number of rows of each group - 1. I'm not sure what solution is the optimal one, nor do I know how to implement this.

有人可以帮忙吗?

这是所需的输出:

# A tibble: 11 x 5
# Groups:   id [5]
      id p201  V2009  ager V2007
   <int> <chr> <dbl> <dbl> <dbl>
 1    1 NA       25  2.3      1
 2    1 NA       11  2        1
 3    1 001      63  8.1      1
 4    1 001      75 12.2      1
 5    2 NA       49  5.1      2
 6    2 NA       14  2        2
 7    3 001      32  2.9      1
 8    4 001      31  2.8      2
 9    5 001       3  2        1
10    5 NA       10  2        1
11    5 001       3  2        1

                 

推荐答案

我认为您所描述的内容并不是真正的递归,因为计算不依赖于先前迭代的结果.但是,它相当复杂,也许将其放入 dplyr 管道的最佳方法是声明一个函数,该函数带有必要的变量并返回您的答案.

I don't think what you are describing is really recursive, in that the calculations don't depend on the results of previous iterations. It is, however, fairly complex, and perhaps the best way to fit it into a dplyr pipeline is to declare a function that takes the necessary variables and returns your answer.

这是一个完成技巧的函数.它使用split-lapply-merge范例强制计算正确地按行进行.然后,它使用sapply检查组中任何前一行的每一行是否满足逻辑条件.如果是这样,它将使用非NA值覆盖该行 p201 值中的 NA :

Here is a function that does the trick. It uses the split-lapply-merge paradigm to force the calculations to work properly row-wise. It then uses an sapply to check whether, for each row, the logical conditions are met in any previous row in the group. If so, it overwrites an NA in that rows p201 value with a non-NA value:

multi_condition <- function(id, v1, v2, v3, v4)
{
  unlist(lapply(split(data.frame(v1, v2, v3, v4), id), function(x) 
  {
    if(all(is.na(x$v1))) return(x$v1)
    
    ss <- unlist(c(FALSE, sapply(seq_along(x$v2)[-1], function(i) 
    {
      x$v2[i] %in% x$v2[1:(i - 1)] & any(abs(x$v3[i] - x$v3[1:(i - 1)]) <= x$v4[i])
    })))   
    replace(x$v1, ss, x$v1[!is.na(x$v1)][1])    
  }))
}

因此,函数本身很复杂,但是其使用却很简单:

So the function itself is complex, but its use is straightforward:

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(p201 = multi_condition(id, p201, V2007, V2009, ager))
#> # A tibble: 11 x 5
#> # Groups:   id [5]
#>       id p201  V2009  ager V2007
#>    <dbl> <chr> <dbl> <dbl> <dbl>
#>  1     1 <NA>     25   2.3     1
#>  2     1 <NA>     11   2       1
#>  3     1 001      63   8.1     1
#>  4     1 001      75  12.1     1
#>  5     2 <NA>     49   5.1     2
#>  6     2 <NA>     14   2       2
#>  7     3 001      32   2.9     1
#>  8     4 001      31   2.8     2
#>  9     5 001       3   2       1
#> 10     5 <NA>     10   2       1
#> 11     5 001       3   2       1

如果您更喜欢 dplyr -使用 group_map 输入解决方案,逻辑可能更清晰一些,则可以尝试:

If you prefer a more dplyr - type solution using group_map, with the logic perhaps a little clearer, you could try:

multi_select <- function(df, ...) 
{
  rowwise_logic <- function(i) 
  {
    if(i == 1) return(FALSE)
    j <- 1:(i - 1)
    df$V2007[i] %in% df$V2007[j] & 
    any(abs(df$V2009[i] - df$V2009[j]) <= df$ager[i])
  }
  
  matching_rows <- sapply(seq(nrow(df)), rowwise_logic)  
  df$p201[matching_rows] <- first(na.exclude(df$p201))

  return(df)
}

会这样工作:

df %>% 
  group_by(id) %>%
  group_map(multi_select, .keep = TRUE) %>%
  bind_rows()

reprex软件包(v0.3.0)创建于2020-07-15 sup>

Created on 2020-07-15 by the reprex package (v0.3.0)

这篇关于R中的递归函数,其终点随组而变化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆