高效计数下降的数字每个范围内数字 [英] Efficiently counting numbers falling within each range of numbers

查看：173 发布时间：2015/11/30 20:57:23 r algorithm optimization data.table mapply

本文介绍了高效计数下降的数字每个范围内数字的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在找一个更快的解决方案，以下面的问题。我会说明问题的一个小例子，然后提供code，以模拟大量的数据因为这是对这个问题的地步。我的实际问题规模列表长度为1万个条目中。

I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.

我说，我有两个列表如下图所示：

Say, I've two lists as shown below:

x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))

x和y的属性：

列表中的每个元素 X 始终总计为100。
是的每个元素将总是被排序并且将始终是1和100之间。

Properties of x and y:

Each element of the list x always sums up to 100.
Each element of y will always be sorted and will be always between 1 and 100.

现在，我想是这样的。以 X [[1]] 和 Y [[1]] ，我想找到的数在数量 Y [[1]] 是1） - = = 82和2）> 82和＆lt; = 100这将是，C（4,1）因为数字＆LT; = 82 C（1,2,3,55）之间，83和100号是 C（90）。同样，对于 X [[2]] 和 Y [[2]] ，C（0，2，1）。也就是说，答案应该是：

Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:

[[1]]
[1] 4 1

[[2]]
[1] 0 2 1

让我知道这是目前还不清楚。

Let me know if this is still unclear.

set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)

x <- lapply(seq_len(n), function(ix) {
    probs <- sample(100:1000, len[ix])
    probs <- probs/sum(probs)

    oo <- round(N * probs)
    if (sum(oo) != 100) {
        oo[1] <- oo[1] + (100 - sum(oo))
    }
    oo
})

require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1

我到目前为止，完成的：

使用 mapply （慢）：

What I've done so far:

Using mapply (slow):

我想用排名与 ties.method第一=的和 mapply （含2列出了明智的选择）第一，尝试了这一点：

I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:

tt1 <- mapply(y, x, FUN=function(a,b) { 
    tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})

虽然这工作得很好，它需要在100万条目很多时间。我认为，计算的开销排名和差异，很多时候更增加了它。这需要241秒！

Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!

所以，我决定尝试并使用排名和差异的用法> data.table ，并以组列进行排序。我想出了一个较长的，但更快如下解决方案：

Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:

使用 data.table （快）：

xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]

# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1

这需要26秒。所以，它的速度更快约9倍。我不知道是否有可能获得更多的加速，因为我将不得不递归地计算这个在5-10这样的百万元。谢谢你。

This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.

推荐答案

下面是另一个 data.table 办法。修改我加了（脏？）黑客工具，用来加速这件事，并使其〜2倍比OP data.table 解决方案快。

Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.

# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))

# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))

# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval'))  # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]

# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))

# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]

# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1

这篇关于高效计数下降的数字每个范围内数字的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

高效计数下降的数字每个范围内数字 [英] Efficiently counting numbers falling within each range of numbers

问题描述

x和y的属性：

Properties of x and y:

我到目前为止，完成的：

What I've done so far:

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

高效计数下降的数字每个范围内数字 [英] Efficiently counting numbers falling within each range of numbers

问题描述

x和y的属性：

Properties of x and y:

我到目前为止，完成的：

What I've done so far:

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭