我的函数调用在data.table j不返回预期的结果 [英] my function called in data.table j not returning expected results

查看:99
本文介绍了我的函数调用在data.table j不返回预期的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我描述的索引问题这里使用devel data.table 版本1.9.7解决。

The indexing problem I described here is resolved with the devel data.table version 1.9.7.

我的问题是要了解我在向自己的函数发送数据和从自己的函数返回时做错了什么。

My question is about understanding what I've done wrong in sending data to and returning from my own function.

如另一个问题所述,我想每个 gvkey 只保留最长的连续段,如果有多个相等长度的段,则取最近的。

As described in the other question I want to keep only the longest continuous segment for each gvkey and if there are multiple equal length segments, take the most recent.

 DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
 DT[, gap := fyear - fyear.lag]



这里我得到预期的结果 data.table v1.9.7):

DT[,         step.idx := 0]    # initialize
DT[gap >=2 , step.idx := 1]    # 1's at each multi-year jump
DT[, step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ ,  seq.lengths := .N,  by=.(gvkey,step.idx)]      # length of each sequence
DT[,   keep.seq := 1*(seq.lengths == max(seq.lengths)), by = gvkey]        # each firm's longest sequence
DT[keep.seq==1,  keep.seq := c(rep(0, (.N-max(seq.lengths))), rep(1, max(seq.lengths))), by = gvkey] 

 #' expected results:
 DT.out <- DT[keep.seq==1] # 23
 DT.out[keep.seq==0, .N] # 0 
 nrow(DT.out)#   [1] 149

当我使用自己的函数尝试基本相同的过程时, c $ c> keep.seq == 0 案例。 我的问题是我为什么不从上面得到与上面相同的结果

When I try essentially the same process with my own function I get extra keep.seq==0 cases. My question is why don't I get the same result as above from this:

find.seq.keep <- function(g){
    step.idx = rep(0, length(g))
    step.idx[g>=2] = 1
    step.idx = cumsum(step.idx)
    N.seq = length(unique(step.idx))

    seq.lengths = as.vector(unlist(tapply(step.idx, step.idx,
                     function(x) rep(length(x), length(x)))))
    keep.seq = 1*(seq.lengths == max(seq.lengths))
    if(length(keep.seq[keep.seq == 1]) > max(seq.lengths)){
      N.max = max(seq.lengths)
      N.1s  = length(keep.seq[keep.seq==1])
      keep.seq[keep.seq==1] = c(rep(0, (N.1s-N.max)), rep(1, N.max))
    }
return(as.list(keep.seq))
}
DT[,keep.seqF := find.seq.keep(gap), by = gvkey]

删除行可行,但有一些假阳性:

Removal of the rows works but there are some false positives of what to remove:

   DT.outF <- DT[keep.seqF==1]
   DT.outF[keep.seqF==0, .N]  # 0
   nrow(DT.outF)   # 141 (<149 = nrow(DT.out)  !!)

我想让我的个人功能工作,以便我仍然可以使用1.9.6版本(使其更容易与同事分享),至少直到现在弗兰克已经为我的问题提供了一个解决方案,我想更好地掌握 j 当我调用 find.seq.keep 时参数。

=======

**可复制的范例资料***

** Reproducible Example Data ***

DT <- data.table(
   gvkey =  c(1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 1681, 
              1681, 1681, 1681, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 1914, 
              1914, 1914, 1914, 1914, 1914, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
              2011, 2011, 2011, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085,
              2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2085, 2164, 2164, 2164, 2164, 
              2164, 2164, 2164, 2164, 2164, 2164, 2164, 2164, 2185, 2185, 2185, 2185, 2185, 
              2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 2185, 
              2185, 2185, 2185),
   fyear = c(1983, 1984, 1985, 1986, 1987, 1988, 1989, 1997, 1998, 2008, 2009, 2010, 2011, 
             2012, 2013, 2014, 1983, 1984, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 
             2001, 2002, 2003, 2004, 2005, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965,
             1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 
             1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
             1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2007, 2008, 
             2009, 2010, 2011, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 
             1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973,
             1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 
             1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 
             2000, 2001, 2002, 2003, 2004, 2005, 2006, 2011, 2012, 1978, 1979, 1980, 1981, 
             1982, 1983, 1984, 1985, 1986, 1989, 1990, 1991, 1970, 1971, 1972, 1973, 1974,
             1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 
             1988, 1994, 1995))

setkey(DT, gvkey, fyear)


推荐答案

我不知道为什么你的函数不工作,但这里有一个替代方法:

I'm not sure why your function is not working, but here's an alternative approach:

DT[, g := cumsum( fyear - shift(fyear, fill=fyear[1L]-1L) != 1L ), by=gvkey]
keep = DT[, 
  .(len = .N), by=.(gvkey, g)][, 
  .( g = g[tail(which(len == max(len)), 1)]), by=gvkey]

DT.out = DT[keep, on=names(keep)]

DT.out[, .N] # 149, as expected

工作原理:


  • g 是在每个 gvkey 中运行的ID。

  • len 是每次跑步的长度。

  • g [tail(which(len == max(len)),1)] 是最长的,通过采取最近的打破了领带。

  • DT [keep,on = names(keep)是将 DT 子集到(gvkey, g)在保留中找到。

  • g is an ID for runs within each gvkey.
  • len is the length of each run.
  • g[tail(which(len == max(len)), 1)] is the longest, breaking ties by taking the most recent.
  • DT[keep, on=names(keep) is a merge that subsets DT to the (gvkey,g) found in keep.

一些原因,你想要一个基函数来做这个...

If, for some reason, you wanted a base function to do this...

tag.long.seq = function(x){
    g    = cumsum(c(1L, diff(x) > 1L))
    len  = tapply(g, g, FUN = length)
    w    = tail(which(len == max(len)), 1L)

    ave(g, g, FUN = function(z) z[1] == w)    
}

DT[, keepem := tag.long.seq(fyear), by=gvkey]

DT[(keepem==1L), .N] # 149 again

这篇关于我的函数调用在data.table j不返回预期的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆