通过减少data.table进行排序 [英] sequence by reducing data.table
问题描述
require(data.table)
set.seed(333)
t <- data.table(old=1002:2001, dif=sample(1:10,1000, replace=TRUE))
t$new <- t$old + t$dif; t$foo <- rnorm(1000); t$dif <- NULL
i <- data.table(id=1:3, start=sample(1000:1990,3))
> i
id start
1: 1 1002
2: 2 1744
3: 3 1656
> head(t)
old new foo
1: 1002 1007 -0.7889534
2: 1003 1004 0.3901869
3: 1004 1014 0.7907947
4: 1005 1011 2.0964612
5: 1006 1007 1.1834171
6: 1007 1015 1.1397910
我想从points
中删除时间点,以便仅将那些行保留在new[i] = old[i-1]
的位置,从而给出一些固定数量的时间点的连续序列.理想情况下,这将同时对i
中的所有id
完成,其中start
给出起点.例如,如果选择n=5
,则应获取
I would like to delete time points from points
such that only those rows remain where new[i] = old[i-1]
, giving a continuous sequence of some fixed number of time points. Ideally, this would be done for all id
in i
simultaneously, where start
gives the starting points. For example, if we choose n=5
, we should obtain
> head(ans)
id old new foo
1: 1 1002 1007 -0.7889534
2: 1 1007 1015 1.1397910
3: 1 1015 1022 -1.2193670
4: 1 1022 1024 1.2039050
5: 1 1024 1026 0.4388586
6: 2 1744 1750 -0.1368320
无法在上面推断出第3至6行,而foo
代表需要保留的其他变量.
where lines 3 to 6 cannot be inferred above and foo
is a stand in for other variables that need to be kept.
例如通过巧妙地结合使用联接,是否可以在data.table中有效地完成此任务?
Can this be done efficiently in data.table, for example, using a clever combination of joins?
PS.这个问题有点类似于我的早期版本,但我已修改情况以使其更加清晰.
PS. This question is somewhat similar to an an earlier one of mine but I have modified the situation to make it clearer.
推荐答案
在我看来,您需要图形算法的帮助.如果要以1002
开头,可以尝试:
It seems to me that you need help from graph algorithms. If you want to start with 1002
, you can try:
require(igraph)
g <- graph_from_edgelist(as.matrix(t[,1:2]))
t[old %in% subcomponent(g,"1002","out")]
# 1: 1002 1007 -0.78895338
# 2: 1007 1015 1.13979100
# 3: 1015 1022 -1.21936662
# 4: 1022 1024 1.20390482
# 5: 1024 1026 0.43885860
# ---
#191: 1981 1988 -0.22054875
#192: 1988 1989 -0.22812175
#193: 1989 1995 -0.04687776
#194: 1995 2000 2.41349730
#195: 2000 2002 -1.23425666
当然,您可以对所需的每个start
执行上述操作,并限制前n
行的结果.例如,我们可以lapply
在i$start
位置上,然后将rbindlist
所有值放在一起,用i$id
值声明一个id
列.像这样:
Of course you can do the above for each start
you want and limiting the results for the first n
rows. For instance, we can lapply
over the i$start
positions and then rbindlist
all the values together, declaring an id
column with the i$id
values. Something like:
n <- 5
rbindlist(
setNames(lapply(i$start, function(x) t[old %in% subcomponent(g,x,"out")[1:n]]), i$id),
idcol="id")
# id old new foo
# 1: 1 1002 1007 -0.7889534
# 2: 1 1007 1015 1.1397910
# 3: 1 1015 1022 -1.2193666
# 4: 1 1022 1024 1.2039048
# 5: 1 1024 1026 0.4388586
# 6: 2 1744 1750 -0.1368320
# 7: 2 1750 1758 0.3331686
# 8: 2 1758 1763 1.3040357
# 9: 2 1763 1767 -1.1715528
#10: 2 1767 1775 0.2841251
#11: 3 1656 1659 -0.1556208
#12: 3 1659 1663 0.1663042
#13: 3 1663 1669 0.3781835
#14: 3 1669 1670 0.2760948
#15: 3 1670 1675 0.3745026
这篇关于通过减少data.table进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!