如何根据特定条件优化与观测值匹配的函数 [英] How to optimize a function that matches observations according to certain criteria

查看:68
本文介绍了如何根据特定条件优化与观测值匹配的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种对给定数据帧进行操作的更有效方法.

I am looking for a more efficient way of doing an operation with a given dataframe.

 library(purrr)
 library(dplyr) 

这是分步说明:

首先,有一个函数possible_matches,对于df中的每个观察值i,给出可能与i匹配的行的索引,这些索引将在下一步中使用:

First, there is the function possible_matches, that for each observation i in df, gives the index of rows that are possibly matchable to i, which are going to be used on the next step:

 possible_matches <-  function(i, df) {
  k1 <- df$j[df$id_0 == df$id_0[i]]
  j2 <- setdiff(df$j, k1)
  k2 <- map(j2, ~ df$j[df$id_0[.] == df$id_0])
  k3 <- map(k2, ~ map(.x,  ~ df$j[df$Year[k1] == df$Year[.] & 
                                    df$Quarter[k1] == df$Quarter[.]]
  ) %>% unlist(.))
  k4 <- map(k3,  ~ length(.) == 0) %>% unlist()
  j2[k4]
}
  

基本上,它将具有相同id的所有行带到i,然后根据某些条件将某些行过滤掉.此函数在函数match1中使用,该函数循环遍历possible_matches给定的所有行,并根据一些其他条件(此处简化)过滤掉更多的行:

Basically, it takes all rows with the same id to i, and then filter some out according to some criteria. This function is used inside function match1, which loops through all rows given by possible_matches, filtering out more of them according to some other criteria (simplified here):

 match_1 <-  function(i, df) {
j <- possible_matches(i, df) 
if (is_empty(j)) {
  out <- i
} else {
  
  g1 <-
    abs(df$V2009[i] - df$V2009[j]) <= 5
     out <- ifelse(!g1, i, j[g1])    
}
return(out)
} 

由于match1可能每行返回多个观察值,因此我必须尽可能地将所有配对的结果分组.我通过定义以下内容来做到这一点:

Since match1 possibly returns multiple observations per row, I have to try to group all paired ones as much as I can. I do this by defining:

modes <- function(x, y) {
  ux <- unique(x)
  tab <- tabulate(match(y, ux))
  ux[tab == max(tab)]
}

在函数equalize_indices中运行它,该函数还将df分成几组,这样就不会循环遍历不必要的行:

Running it inside function equalize_indices, which also splits df into groups so that there is no looping through unnecessary rows:

 equalize_indices <- function(df, prev_id) {
    df1 <- df %>%
      group_split()
    
    w <- df1 %>%
      map(~ .x %>%
            nrow() %>%
            seq())
    
    df1 <- map(df1, ~.x %>%
                 mutate(j = row_number())
    )
    x <- map2(w, df1, ~ map(.x, match_1, df = .y))
    
    z <- map(x, function(x){
      map(x, ~ modes(., x) %>%
            min(.))
      })
    
    df3 <- map2(df1, z,  ~.x %>%
                  mutate(index = .y) %>%
                  group_by(index) %>%
                  mutate(index = min({{prev_id}})) %>%
                  select(-j)
    )
    df <- bind_rows(df3)
    return(df)
  }

编辑 最后,这是一些具有预期输出的较大数据:

EDIT Finally, here is some larger data with the expected output:

set.seed(1)

DF <- data.frame(
  UPA = 1,
  Quarter = sample(1:4, 8, replace  = TRUE),
  Year = sample(2010:2015, 8, replace  = TRUE),
  id_0 = sample(2:10, 8, replace  = TRUE),
  V2009 = c(19, 22, 17, 10, 37, 19, 22, 17)
  ) %>%
  group_by(UPA)

  DF %>%
  equalize_indices(prev_id = id_0)

这是我的问题:使用具有更多条件的25k行(约30分钟)的数据帧运行此过程会花费太长时间.为什么是这样?有什么方法可以使过程更快?这需要可扩展到非常大的数据帧.我知道循环可能会花费一些时间,但是通过在equalize_indices中使用group_split,我可以将循环变小.

Here is my question: it takes too long to run this procedure with a data frame with 25k rows (about 30 min) using more conditionalities. Why is this? Is there some way to turn the process faster? This needs to be scalable to very large data frames. I know looping may take time, but by using group_split inside equalize_indices, I can turn the loops smaller.

如何优化此过程? -从根本上讲,这是一个过程,即拆分数据集,消除对每行的不匹配观察,为每行选择最常见的匹配索引,然后再次绑定行.

How can I optimize this procedure? - It basically is a process splitting a dataset, eliminating unmatchable observations to each row, picking the most common matched index for each row, and then binding the rows again.

我什至不确定哪个确切的部分会很费时.

I am not even sure which exact part is so time consuming.

推荐答案

这是一个方法:

首先,我们创建一个新函数,而不是返回整个data.frame,我们只返回索引

First we create a new function that instead of returning an entire data.frame, we just return the index

equalize_indices2 = function(DF) {
  n = nrow(DF)
  if (n == 1L) return(DF$id_0) ## short-circuit; no need to do any comparisons!
  j = seq(n)
  
  ##check equalities up front so we only have to do it once for all rows
  id_0_eq = outer(DF$id_0, DF$id_0, `==`)
  year_quarter_eq = outer(DF$Year, DF$Year, `==`) & outer(DF$Quarter, DF$Quarter, `==`)
  
  ##pre-allocate vector
  ans = vector('integer', n)
  
  ##simplify logic; id_0 == id_0 but none of the other elements are equal (e.g., id_0[1] != id_0[2])
  only_diag_true = sum(id_0_eq) == n
  
  for (i in j) {
    ######replacement of possible matches#######
    if (only_diag_true) { 
      j2 = j[-i]
      tmp = j2[!year_quarter_eq[j2, i]]
    } else {
      ##TODO match expected output of larger dataset. May be due to downstream issues.
      k1 = j[id_0_eq[i, ]]
      j2 = j[-k1]
      # k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, j[id_0_eq[x, ]]]], j2), use.names = FALSE)
      k4 = vapply(j2, function(x) length(j[year_quarter_eq[k1, j[id_0_eq[x, ]]]]) == 0L, NA)
      # k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, x]], k2), use.names = FALSE)
      
      tmp = j2[k4]
    }
    
    #######replacement of match_1##########
    if (!length(tmp)) { ##This happens when nrow(df) == 1L or the year and quarters match...I think
      out = i
    } else {
      g1 = abs(DF$V2009[i] - DF$V2009[tmp]) <= 5
      out = ifelse(!g1, i, tmp[g1])  
    }
    
    ########replacement of modes#######
    ans[i] = if (length(out) == 1L) out else out[which.max(out)]
  } 
  
  ## replacement of df3 call in OP
  index = ave(DF$id_0, ans, FUN = function(x) {x[1L]})
  return(index)
}

然后,我们使用刚刚创建的功能在适当位置修改data.table:

Then, we modify data.table in place using the function we just created:

library(data.table)
setDT(df)
df[, index := equalize_indices2(.SD), by = UPA]
df

##      UPA Quarter  Year  id_0 V2009 index
##    <num>  <char> <num> <int> <num> <int>
## 1:     1       1  2012     1    19     1
## 2:     1       1  2012     2    22     1
## 3:     1       1  2011     3    17     1
## 4:     1       1  2012     4    10     4
## 5:     2       1  2012     5    37     5

主要优点是我们使等式运算符最小化,因为对于每个UPA组,我们仅对每个组而不是对每一行查看一次相等性.另一个好处是,我们不必处理那么多嵌套列表.实际上,这不包括任何列表.

The main benefits are that we minimize equality operators because for each UPA group, we only look at the equalities once per group instead of for each row. The other nice thing is that we do not have to deal with as many nested lists. In fact, this includes no lists.

这仍然是一个WIP,今天晚些时候我将做TODO逻辑行,但这适用于当前数据集.如果您提供更大的数据集,我也将对此进行测试.

This is still a WIP, I am going to do the TODO logic line later today but this works for the current dataset. If you provide a larger dataset, I will also test against that as well.

就性能而言,这比OP快100倍以上.请注意check = FALSE,因为它们返回不同的类(例如,data.table与仅一个data.frame)

As far as performance, this is more than 100x faster than OP. Note check = FALSE because they return different classes (e.g., data.table vs. only a data.frame)

dt = as.data.table(df)
bench::mark(dt_new_fx = dt[, index := equalize_indices2(.SD), by = UPA],
            OP = {df %>%
              equalize_indices(prev_id = id_0)},
            check = FALSE
)

## # A tibble: 2 x 13
##   expression   min median `itr/sec` mem_alloc
##   <bch:expr> <bch> <bch:>     <dbl> <bch:byt>
## 1 dt_new_fx  370us  390us    2439.     32.5KB
## 2 OP          46ms   46ms      21.8    39.5KB

这篇关于如何根据特定条件优化与观测值匹配的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆