通过单个R数据表中的组有效地定位 [英] efficiently locf by groups in a single R data.table

查看:149
本文介绍了通过单个R数据表中的组有效地定位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的,宽的 data.table (20m行)由一个人ID键入,但是有很多空值的列(〜150)。每列是我希望针对每个人进行的记录的状态/属性。每个人可以具有从10到10,000观察值,并且在集合中有大约50万人。来自一个人的值不能泄漏到以下人员,因此我的解决方案必须尊重人员ID列和适当分组。

I have a large, wide data.table (20m rows) keyed by a person ID but with lots of columns (~150) that have lots of null values. Each column is a recorded state / attribute that I wish to carry forward for each person. Each person may have anywhere from 10 to 10,000 observations and there are about 500,000 people in the set. Values from one person can not 'bleed' into the following person, so my solution must respect the person ID column and group appropriately.

为了演示的目的 - 这里是一个非常小示例输入:

For demonstration purposes - here's a very small sample input:

DT = data.table(
  id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
  aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA),
  bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
  cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA)
)

它看起来像这样:

    id aa bb cc
 1:  1  A NA  1
 2:  1 NA NA NA
 3:  1  B NA NA
 4:  1  C NA NA
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA NA
 8:  2  E NA  5
 9:  3  F NA  6
10:  3 NA NA NA
11:  3 NA NA  7
12:  3 NA NA NA

我的预期输出如下: / p>

My expected output looks like this:

    id aa bb cc
 1:  1  A NA  1
 2:  1  A NA  1
 3:  1  B NA  1
 4:  1  C NA  1
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA  4
 8:  2  E NA  5
 9:  3  F NA  6
10:  3  F NA  6
11:  3  F NA  7
12:  3  F NA  7

我发现一个 data.table ,但是对于我的大型数据集,它的速度非常慢:

I've found a data.table solution that works, but it's terribly slow on my large data sets:

DT[, na.locf(.SD, na.rm=FALSE), by=id]



我发现使用dplyr的同样慢的解决方案。 p>

I've found equivalent solutions using dplyr that are equally slow.

GRP = DT %>% group_by(id)
data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))



我希望我可以使用 data.table 功能提出一个滚动的自我联接,但我似乎不能得到它正确(我怀疑我需要使用 .N 但我只是没有想出来。)

I was hopeful that I could come up with a rolling 'self' join using the data.table functionality, but I just can't seem to get it right (I suspect I would need to use .N but I just haven't figured it out).

我们必须在Rcpp中写一些东西以有效地应用分组的locf。

At this point I'm thinking I'll have to write something in Rcpp to efficiently apply the grouped locf.

我是R的新手,但我不是C ++的新手 - 所以我有信心我能做到。我只是觉得应该有一个有效的方法,在R中使用 data.table

I'm new to R, but I'm not new to C++ - so I'm confident I can do it. I just feel like there should be an efficient way to do this in R using data.table.

推荐答案

可以通过转发( cummax )非常简单的 na.locf - NA 索引((!is.na(x))* seq_along(x) / p>

A very simple na.locf can be built by forwarding (cummax) the non-NA indices ((!is.na(x)) * seq_along(x)) and subsetting accordingly:

x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2

这会用 na.rm = TRUE 复制 na.locf code>参数,要获得 na.rm = FALSE 行为,我们只需要确保 cummax TRUE

This replicates na.locf with an na.rm = TRUE argument, to get na.rm = FALSE behavior we simply need to make sure the first element in the cummax is TRUE:

x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA  1  1  2

在这种情况下,我们不仅需要考虑非< c $ c> NA 索引,但是在(有序或有序)id列的值改变的索引中:

In this case, we need to take into account not only the non-NA indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

结合上述:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x =  c(1,  NA, NA, 6,  4,  5,  4,  NA, NA, 2)

x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1]  1  1 NA  6  4  5  4  4 NA  2

注意,这里我们 c $ c> TRUE ,即使它等于 TRUE ,从而得到 na.rm = FALSE 行为。

Note, that here we OR the first element with TRUE, i.e. make it equal to TRUE, thus getting the na.rm = FALSE behavior.

对于此示例:

id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
#    id aa bb cc
# 1:  1  A NA  1
# 2:  1  A NA  1
# 3:  1  B NA  1
# 4:  1  C NA  1
# 5:  2 NA NA NA
# 6:  2 NA NA  4
# 7:  2  D NA  4
# 8:  2  E NA  5
# 9:  3  F NA  6
#10:  3  F NA  6
#11:  3  F NA  7
#12:  3  F NA  7

这篇关于通过单个R数据表中的组有效地定位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆