我的函数调用在data.table j不返回预期的结果 [英] my function called in data.table j not returning expected results
问题描述
我描述的索引问题这里使用devel data.table
版本1.9.7解决。
The indexing problem I described here is resolved with the devel data.table
version 1.9.7.
我的问题是要了解我在向自己的函数发送数据和从自己的函数返回时做错了什么。
My question is about understanding what I've done wrong in sending data to and returning from my own function.
如另一个问题所述,我想每个 gvkey
只保留最长的连续段,如果有多个相等长度的段,则取最近的。
As described in the other question I want to keep only the longest continuous segment for each gvkey
and if there are multiple equal length segments, take the most recent.
DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
DT[, gap := fyear - fyear.lag]
这里我得到预期的结果 data.table
v1.9.7):
DT[, step.idx := 0] # initialize
DT[gap >=2 , step.idx := 1] # 1's at each multi-year jump
DT[, step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ , seq.lengths := .N, by=.(gvkey,step.idx)] # length of each sequence
DT[, keep.seq := 1*(seq.lengths == max(seq.lengths)), by = gvkey] # each firm's longest sequence
DT[keep.seq==1, keep.seq := c(rep(0, (.N-max(seq.lengths))), rep(1, max(seq.lengths))), by = gvkey]
#' expected results:
DT.out <- DT[keep.seq==1] # 23
DT.out[keep.seq==0, .N] # 0
nrow(DT.out)# [1] 149
当我使用自己的函数尝试基本相同的过程时, c $ c> keep.seq == 0 案例。 我的问题是我为什么不从上面得到与上面相同的结果:
When I try essentially the same process with my own function I get extra keep.seq==0
cases. My question is why don't I get the same result as above from this:
find.seq.keep <- function(g){
step.idx = rep(0, length(g))
step.idx[g>=2] = 1
step.idx = cumsum(step.idx)
N.seq = length(unique(step.idx))
seq.lengths = as.vector(unlist(tapply(step.idx, step.idx,
function(x) rep(length(x), length(x)))))
keep.seq = 1*(seq.lengths == max(seq.lengths))
if(length(keep.seq[keep.seq == 1]) > max(seq.lengths)){
N.max = max(seq.lengths)
N.1s = length(keep.seq[keep.seq==1])
keep.seq[keep.seq==1] = c(rep(0, (N.1s-N.max)), rep(1, N.max))
}
return(as.list(keep.seq))
}
DT[,keep.seqF := find.seq.keep(gap), by = gvkey]
删除行可行,但有一些假阳性:
Removal of the rows works but there are some false positives of what to remove:
DT.outF <- DT[keep.seqF==1]
DT.outF[keep.seqF==0, .N] # 0
nrow(DT.outF) # 141 (<149 = nrow(DT.out) !!)
我想让我的个人功能工作,以便我仍然可以使用1.9.6版本(使其更容易与同事分享),至少直到现在弗兰克已经为我的问题提供了一个解决方案,我想更好地掌握 j
当我调用 find.seq.keep
时参数。
=======
**可复制的范例资料***
** Reproducible Example Data ***
DT <- data.table(
gvkey = c(1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681,
1681, 1681, 1681, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914,
1914, 1914, 1914, 1914, 1914, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2164, 2164, 2164, 2164,
2164, 2164, 2164, 2164, 2164, 2164, 2164, 2164, 2185, 2185, 2185, 2185, 2185,
2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185,
2185, 2185, 2185),
fyear = c(1983, 1984, 1985, 1986, 1987, 1988, 1989, 1997, 1998, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 1983, 1984, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965,
1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978,
1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2007, 2008,
2009, 2010, 2011, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973,
1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986,
1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2011, 2012, 1978, 1979, 1980, 1981,
1982, 1983, 1984, 1985, 1986, 1989, 1990, 1991, 1970, 1971, 1972, 1973, 1974,
1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
1988, 1994, 1995))
setkey(DT, gvkey, fyear)
推荐答案
我不知道为什么你的函数不工作,但这里有一个替代方法:
I'm not sure why your function is not working, but here's an alternative approach:
DT[, g := cumsum( fyear - shift(fyear, fill=fyear[1L]-1L) != 1L ), by=gvkey]
keep = DT[,
.(len = .N), by=.(gvkey, g)][,
.( g = g[tail(which(len == max(len)), 1)]), by=gvkey]
DT.out = DT[keep, on=names(keep)]
DT.out[, .N] # 149, as expected
工作原理:
-
g
是在每个gvkey
中运行的ID。 -
len
是每次跑步的长度。 -
g [tail(which(len == max(len)),1)]
是最长的,通过采取最近的打破了领带。 -
DT [keep,on = names(keep)
是将DT
子集到(gvkey, g)
在保留中找到。
g
is an ID for runs within eachgvkey
.len
is the length of each run.g[tail(which(len == max(len)), 1)]
is the longest, breaking ties by taking the most recent.DT[keep, on=names(keep)
is a merge that subsetsDT
to the(gvkey,g)
found in keep.
一些原因,你想要一个基函数来做这个...
If, for some reason, you wanted a base function to do this...
tag.long.seq = function(x){
g = cumsum(c(1L, diff(x) > 1L))
len = tapply(g, g, FUN = length)
w = tail(which(len == max(len)), 1L)
ave(g, g, FUN = function(z) z[1] == w)
}
DT[, keepem := tag.long.seq(fyear), by=gvkey]
DT[(keepem==1L), .N] # 149 again
这篇关于我的函数调用在data.table j不返回预期的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!