优化 sapply() 或 for()、paste(),有效地将稀疏三元组矩阵转换为 libsvm 格式 [英] Optimising sapply() or for(), paste(), to efficiently transform sparse triplet matrix to a libsvm format
问题描述
我有一段 R 代码,我想对其进行优化,以提高处理更大数据集的速度.它目前依赖于 sapply
循环遍历一个数字向量(对应于稀疏矩阵的行).下面的可重现示例解决了问题的关键;是三行函数 expensive()
占用了时间,其原因很明显(许多相互匹配的大向量,每个循环有两个嵌套的 paste
语句循环).在我放弃并开始努力用 C++ 完成这些工作之前,我是否遗漏了什么?有没有办法对 sapply
调用进行矢量化,使其速度提高一个或三个数量级?
I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply
cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive()
that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste
statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply
call that will make it an order of magnitude or three faster?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
expensive
的输出是一个字符向量,由 n
个元素组成,如下所示:
The output of expensive
is a character vector, of n
elements, that looks like this:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
就其价值而言,其动机是有效地从稀疏矩阵格式写入数据 - 无论是从 slam
或 Matrix
,但从 slam开始代码> - 转换为 libsvm 格式(这是上面的格式,但每一行都以一个数字开头,代表支持向量机的目标变量 - 在本例中省略,因为它不是速度问题的一部分).试图改进这个问题的答案.我分叉了从那里引用的存储库之一,并调整了它的方法以使用 这些函数.测试 表明它工作正常;但它 不按比例放大.
For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam
or Matrix
, but starting with slam
- into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.
推荐答案
使用包 data.table.它的 by
与快速排序相结合,使您无需查找相同 i
值的索引.
Use package data.table. Its by
combined with the fast sorting saves you from finding the indices of equal i
values.
res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a
这篇关于优化 sapply() 或 for()、paste(),有效地将稀疏三元组矩阵转换为 libsvm 格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!