在单个 R data.table 中按组有效地定位 [英] efficiently locf by groups in a single R data.table

查看:18
本文介绍了在单个 R data.table 中按组有效地定位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大而宽的 data.table(20m 行),由一个人 ID 键控,但有很多列(~150)有很多空值.每列都是我希望为每个人继承的记录状态/属性.每个人可能有 10 到 10,000 个观察值,集合中大约有 500,000 人.一个人的价值观不能渗入"到下一个人,所以我的解决方案必须尊重人 ID 列并适当地分组.

I have a large, wide data.table (20m rows) keyed by a person ID but with lots of columns (~150) that have lots of null values. Each column is a recorded state / attribute that I wish to carry forward for each person. Each person may have anywhere from 10 to 10,000 observations and there are about 500,000 people in the set. Values from one person can not 'bleed' into the following person, so my solution must respect the person ID column and group appropriately.

出于演示目的 - 这是一个非常小的示例输入:

For demonstration purposes - here's a very small sample input:

DT = data.table(
  id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
  aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA),
  bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
  cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA)
)

看起来像这样:

    id aa bb cc
 1:  1  A NA  1
 2:  1 NA NA NA
 3:  1  B NA NA
 4:  1  C NA NA
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA NA
 8:  2  E NA  5
 9:  3  F NA  6
10:  3 NA NA NA
11:  3 NA NA  7
12:  3 NA NA NA

我的预期输出如下所示:

My expected output looks like this:

    id aa bb cc
 1:  1  A NA  1
 2:  1  A NA  1
 3:  1  B NA  1
 4:  1  C NA  1
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA  4
 8:  2  E NA  5
 9:  3  F NA  6
10:  3  F NA  6
11:  3  F NA  7
12:  3  F NA  7

我找到了一个有效的 data.table 解决方案,但它在我的大型数据集上非常慢:

I've found a data.table solution that works, but it's terribly slow on my large data sets:

DT[, na.locf(.SD, na.rm=FALSE), by=id]

我发现使用 dplyr 的等效解决方案同样慢.

I've found equivalent solutions using dplyr that are equally slow.

GRP = DT %>% group_by(id)
data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))

我希望我可以使用 data.table 功能想出一个滚动的自我"连接,但我似乎无法正确(我怀疑我需要使用 .N 但我还没有弄清楚).

I was hopeful that I could come up with a rolling 'self' join using the data.table functionality, but I just can't seem to get it right (I suspect I would need to use .N but I just haven't figured it out).

此时我想我必须在 Rcpp 中写一些东西来有效地应用分组的 locf.

At this point I'm thinking I'll have to write something in Rcpp to efficiently apply the grouped locf.

我是 R 新手,但我对 C++ 并不陌生 - 所以我有信心我能做到.我只是觉得应该有一种在 R 中使用 data.table 的有效方法来做到这一点.

I'm new to R, but I'm not new to C++ - so I'm confident I can do it. I just feel like there should be an efficient way to do this in R using data.table.

推荐答案

一个非常简单的na.locf可以通过转发(cummax)非来构建>NA 索引 ((!is.na(x)) * seq_along(x)) 和相应的子集:

A very simple na.locf can be built by forwarding (cummax) the non-NA indices ((!is.na(x)) * seq_along(x)) and subsetting accordingly:

x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2

这将使用 na.rm = TRUE 参数复制 na.locf,以获得 na.rm = FALSE 行为,我们只需要确保 cummax 中的第一个元素是 TRUE:

This replicates na.locf with an na.rm = TRUE argument, to get na.rm = FALSE behavior we simply need to make sure the first element in the cummax is TRUE:

x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA  1  1  2

在这种情况下,我们不仅需要考虑非 NA 索引,还需要考虑(已排序或待排序)id"列更改值的索引:

In this case, we need to take into account not only the non-NA indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

结合以上:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x =  c(1,  NA, NA, 6,  4,  5,  4,  NA, NA, 2)

x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1]  1  1 NA  6  4  5  4  4 NA  2

注意,这里我们将第一个元素与TRUEOR,即使其等于TRUE,从而得到na.rm = FALSE 行为.

Note, that here we OR the first element with TRUE, i.e. make it equal to TRUE, thus getting the na.rm = FALSE behavior.

对于这个例子:

id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
#    id aa bb cc
# 1:  1  A NA  1
# 2:  1  A NA  1
# 3:  1  B NA  1
# 4:  1  C NA  1
# 5:  2 NA NA NA
# 6:  2 NA NA  4
# 7:  2  D NA  4
# 8:  2  E NA  5
# 9:  3  F NA  6
#10:  3  F NA  6
#11:  3  F NA  7
#12:  3  F NA  7

这篇关于在单个 R data.table 中按组有效地定位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆