如何根据特定条件优化与观测值匹配的函数 [英] How to optimize a function that matches observations according to certain criteria
问题描述
我正在寻找一种对给定数据帧进行操作的更有效方法.
I am looking for a more efficient way of doing an operation with a given dataframe.
library(purrr)
library(dplyr)
这是分步说明:
首先,有一个函数possible_matches
,对于df
中的每个观察值i
,给出可能与i
匹配的行的索引,这些索引将在下一步中使用:
First, there is the function possible_matches
, that for each observation i
in df
, gives the index of rows that are possibly matchable to i
, which are going to be used on the next step:
possible_matches <- function(i, df) {
k1 <- df$j[df$id_0 == df$id_0[i]]
j2 <- setdiff(df$j, k1)
k2 <- map(j2, ~ df$j[df$id_0[.] == df$id_0])
k3 <- map(k2, ~ map(.x, ~ df$j[df$Year[k1] == df$Year[.] &
df$Quarter[k1] == df$Quarter[.]]
) %>% unlist(.))
k4 <- map(k3, ~ length(.) == 0) %>% unlist()
j2[k4]
}
基本上,它将具有相同id
的所有行带到i
,然后根据某些条件将某些行过滤掉.此函数在函数match1
中使用,该函数循环遍历possible_matches
给定的所有行,并根据一些其他条件(此处简化)过滤掉更多的行:
Basically, it takes all rows with the same id
to i
, and then filter some out according to some criteria. This function is used inside function match1
, which loops through all rows given by possible_matches
, filtering out more of them according to some other criteria (simplified here):
match_1 <- function(i, df) {
j <- possible_matches(i, df)
if (is_empty(j)) {
out <- i
} else {
g1 <-
abs(df$V2009[i] - df$V2009[j]) <= 5
out <- ifelse(!g1, i, j[g1])
}
return(out)
}
由于match1
可能每行返回多个观察值,因此我必须尽可能地将所有配对的结果分组.我通过定义以下内容来做到这一点:
Since match1
possibly returns multiple observations per row, I have to try to group all paired ones as much as I can. I do this by defining:
modes <- function(x, y) {
ux <- unique(x)
tab <- tabulate(match(y, ux))
ux[tab == max(tab)]
}
在函数equalize_indices
中运行它,该函数还将df
分成几组,这样就不会循环遍历不必要的行:
Running it inside function equalize_indices
, which also splits df
into groups so that there is no looping through unnecessary rows:
equalize_indices <- function(df, prev_id) {
df1 <- df %>%
group_split()
w <- df1 %>%
map(~ .x %>%
nrow() %>%
seq())
df1 <- map(df1, ~.x %>%
mutate(j = row_number())
)
x <- map2(w, df1, ~ map(.x, match_1, df = .y))
z <- map(x, function(x){
map(x, ~ modes(., x) %>%
min(.))
})
df3 <- map2(df1, z, ~.x %>%
mutate(index = .y) %>%
group_by(index) %>%
mutate(index = min({{prev_id}})) %>%
select(-j)
)
df <- bind_rows(df3)
return(df)
}
编辑 最后,这是一些具有预期输出的较大数据:
EDIT Finally, here is some larger data with the expected output:
set.seed(1)
DF <- data.frame(
UPA = 1,
Quarter = sample(1:4, 8, replace = TRUE),
Year = sample(2010:2015, 8, replace = TRUE),
id_0 = sample(2:10, 8, replace = TRUE),
V2009 = c(19, 22, 17, 10, 37, 19, 22, 17)
) %>%
group_by(UPA)
DF %>%
equalize_indices(prev_id = id_0)
这是我的问题:使用具有更多条件的25k行(约30分钟)的数据帧运行此过程会花费太长时间.为什么是这样?有什么方法可以使过程更快?这需要可扩展到非常大的数据帧.我知道循环可能会花费一些时间,但是通过在equalize_indices
中使用group_split
,我可以将循环变小.
Here is my question: it takes too long to run this procedure with a data frame with 25k rows (about 30 min) using more conditionalities. Why is this? Is there some way to turn the process faster? This needs to be scalable to very large data frames. I know looping may take time, but by using group_split
inside equalize_indices
, I can turn the loops smaller.
如何优化此过程? -从根本上讲,这是一个过程,即拆分数据集,消除对每行的不匹配观察,为每行选择最常见的匹配索引,然后再次绑定行.
How can I optimize this procedure? - It basically is a process splitting a dataset, eliminating unmatchable observations to each row, picking the most common matched index for each row, and then binding the rows again.
我什至不确定哪个确切的部分会很费时.
I am not even sure which exact part is so time consuming.
推荐答案
这是一个 data.table 方法:
首先,我们创建一个新函数,而不是返回整个data.frame
,我们只返回索引
First we create a new function that instead of returning an entire data.frame
, we just return the index
equalize_indices2 = function(DF) {
n = nrow(DF)
if (n == 1L) return(DF$id_0) ## short-circuit; no need to do any comparisons!
j = seq(n)
##check equalities up front so we only have to do it once for all rows
id_0_eq = outer(DF$id_0, DF$id_0, `==`)
year_quarter_eq = outer(DF$Year, DF$Year, `==`) & outer(DF$Quarter, DF$Quarter, `==`)
##pre-allocate vector
ans = vector('integer', n)
##simplify logic; id_0 == id_0 but none of the other elements are equal (e.g., id_0[1] != id_0[2])
only_diag_true = sum(id_0_eq) == n
for (i in j) {
######replacement of possible matches#######
if (only_diag_true) {
j2 = j[-i]
tmp = j2[!year_quarter_eq[j2, i]]
} else {
##TODO match expected output of larger dataset. May be due to downstream issues.
k1 = j[id_0_eq[i, ]]
j2 = j[-k1]
# k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, j[id_0_eq[x, ]]]], j2), use.names = FALSE)
k4 = vapply(j2, function(x) length(j[year_quarter_eq[k1, j[id_0_eq[x, ]]]]) == 0L, NA)
# k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, x]], k2), use.names = FALSE)
tmp = j2[k4]
}
#######replacement of match_1##########
if (!length(tmp)) { ##This happens when nrow(df) == 1L or the year and quarters match...I think
out = i
} else {
g1 = abs(DF$V2009[i] - DF$V2009[tmp]) <= 5
out = ifelse(!g1, i, tmp[g1])
}
########replacement of modes#######
ans[i] = if (length(out) == 1L) out else out[which.max(out)]
}
## replacement of df3 call in OP
index = ave(DF$id_0, ans, FUN = function(x) {x[1L]})
return(index)
}
然后,我们使用刚刚创建的功能在适当位置修改data.table
:
Then, we modify data.table
in place using the function we just created:
library(data.table)
setDT(df)
df[, index := equalize_indices2(.SD), by = UPA]
df
## UPA Quarter Year id_0 V2009 index
## <num> <char> <num> <int> <num> <int>
## 1: 1 1 2012 1 19 1
## 2: 1 1 2012 2 22 1
## 3: 1 1 2011 3 17 1
## 4: 1 1 2012 4 10 4
## 5: 2 1 2012 5 37 5
主要优点是我们使等式运算符最小化,因为对于每个UPA
组,我们仅对每个组而不是对每一行查看一次相等性.另一个好处是,我们不必处理那么多嵌套列表.实际上,这不包括任何列表.
The main benefits are that we minimize equality operators because for each UPA
group, we only look at the equalities once per group instead of for each row. The other nice thing is that we do not have to deal with as many nested lists. In fact, this includes no lists.
这仍然是一个WIP,今天晚些时候我将做TODO
逻辑行,但这适用于当前数据集.如果您提供更大的数据集,我也将对此进行测试.
This is still a WIP, I am going to do the TODO
logic line later today but this works for the current dataset. If you provide a larger dataset, I will also test against that as well.
就性能而言,这比OP快100倍以上.请注意check = FALSE
,因为它们返回不同的类(例如,data.table
与仅一个data.frame
)
As far as performance, this is more than 100x faster than OP. Note check = FALSE
because they return different classes (e.g., data.table
vs. only a data.frame
)
dt = as.data.table(df)
bench::mark(dt_new_fx = dt[, index := equalize_indices2(.SD), by = UPA],
OP = {df %>%
equalize_indices(prev_id = id_0)},
check = FALSE
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch> <bch:> <dbl> <bch:byt>
## 1 dt_new_fx 370us 390us 2439. 32.5KB
## 2 OP 46ms 46ms 21.8 39.5KB
这篇关于如何根据特定条件优化与观测值匹配的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!