高效计数下降的数字每个范围内数字 [英] Efficiently counting numbers falling within each range of numbers

查看:173
本文介绍了高效计数下降的数字每个范围内数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在找一个更快的解决方案,以下面的问题。我会说明问题的一个小例子,然后提供code,以模拟大量的数据因为这是对这个问题的地步。我的实际问题规模列表长度为1万个条目中。

I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.

我说,我有两个列表如下图所示:

Say, I've two lists as shown below:

x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))

x和y的属性:

  • 列表中的每个元素 X 始终总计为100。
  • 的每个元素将总是被排序并且将始终是1和100之间。
  • Properties of x and y:

    • Each element of the list x always sums up to 100.
    • Each element of y will always be sorted and will be always between 1 and 100.
    • 现在,我想是这样的。以 X [[1]] Y [[1]] ,我想找到的数在数量 Y [[1]] 是1) - = = 82和2)> 82和&lt; = 100这将是,C(4,1)因为数字&LT; = 82 C(1,2,3,55)之间,83和100号是 C(90)。同样,对于 X [[2]] Y [[2]] ,C(0,2,1) 。也就是说,答案应该是:

      Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:

      [[1]]
      [1] 4 1
      
      [[2]]
      [1] 0 2 1
      

      让我知道这是目前还不清楚。

      Let me know if this is still unclear.

      set.seed(1)
      N <- 100
      n <- 1e6
      len <- sample(2:3, n, TRUE)
      
      x <- lapply(seq_len(n), function(ix) {
          probs <- sample(100:1000, len[ix])
          probs <- probs/sum(probs)
      
          oo <- round(N * probs)
          if (sum(oo) != 100) {
              oo[1] <- oo[1] + (100 - sum(oo))
          }
          oo
      })
      
      require(data.table)
      ss <- sample(1:10, n, TRUE)
      dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
      setkey(dt, grp, val)
      y <- dt[, list(list(val)),by=grp]$V1
      


      我到目前为止,完成的:

      使用 mapply (慢):


      What I've done so far:

      Using mapply (slow):

      我想用排名 ties.method第一=的 mapply (含2列出了明智的选择)第一,尝试了这一点:

      I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:

      tt1 <- mapply(y, x, FUN=function(a,b) { 
          tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
      })
      

      虽然这工作得很好,它需要在100万条目很多时​​间。我认为,计算的开销排名差异,很多时候更增加了它。这需要241秒

      Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!

      所以,我决定尝试并使用排名和差异的用法> data.table ,并以组列进行排序。我想出了一个较长的,但更快如下解决方案:

      Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:

      使用 data.table (快):

      xl <- sapply(x, length)
      yl <- sapply(y, length)
      xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
      xdt[, cumval := cumsum(val), by=grp]
      ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
      tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
      setkey(tt2, grp, val)
      xdt.pos <- which(tt2$type == "x")
      tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
      tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
      tt2[, val := xdt$val]
      
      # time consuming step
      tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
      tt2 <- tt2[, list(list(V1)),by=grp]$V1
      

      这需要26秒。所以,它的速度更快约9倍。我不知道是否有可能获得更多的加速,因为我将不得不递归地计算这个在5-10这样的百万元。谢谢你。

      This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.

      推荐答案

      下面是另一个 data.table 办法。 修改我加了(脏?)黑客工具,用来加速这件事,并使其〜2倍比OP data.table 解决方案快。

      Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.

      # compile the data.table's, set appropriate keys
      xl <- sapply(x, length)
      yl <- sapply(y, length)
      xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
      xdt[, cumval := cumsum(val), by=grp]
      ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))
      
      # hack #0, set key but prevent sorting, since we know data is already sorted
      setattr(ydt, 'sorted', c('grp', 'val'))
      
      # by setting the key in y to val and in x to cumval we can
      # leverage the rolling joins
      setattr(xdt, 'sorted', c('grp', 'cumval'))  # hack #1 set key, but prevent sorting
      vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]
      
      # hack #2, same deal as above
      # we know that the order of cumval and cumval.copy is the same
      # so let's convince data.table in that
      setattr(vals, 'sorted', c('grp', 'cumval.copy'))
      
      # compute the counts and fill in the missing 0's
      # for when there is no y in the appropriate x interval
      tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]
      
      # convert to list
      tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1
      

      这篇关于高效计数下降的数字每个范围内数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆