优化 sapply() 或 for()、paste(),有效地将稀疏三元组矩阵转换为 libsvm 格式 [英] Optimising sapply() or for(), paste(), to efficiently transform sparse triplet matrix to a libsvm format

查看:58
本文介绍了优化 sapply() 或 for()、paste(),有效地将稀疏三元组矩阵转换为 libsvm 格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一段 R 代码,我想对其进行优化,以提高处理更大数据集的速度.它目前依赖于 sapply 循环遍历一个数字向量(对应于稀疏矩阵的行).下面的可重现示例解决了问题的关键;是三行函数 expensive() 占用了时间,其原因很明显(许多相互匹配的大向量,每个循环有两个嵌套的 paste 语句循环).在我放弃并开始努力用 C++ 完成这些工作之前,我是否遗漏了什么?有没有办法对 sapply 调用进行矢量化,使其速度提高一个或三个数量级?

I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?

library(microbenchmark)

# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000

# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10

stm <- data.frame(
  i = sample(1:n, nonzerovalues, replace = TRUE),
  j = sample(1:ncols, nonzerovalues, replace = TRUE),
  v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)

# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v

expensive <- function(){
  sapply(1:n, function(k){
    # microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
    whichi <- which(i == k)
    paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
  })
}

microbenchmark(expensive())

expensive 的输出是一个字符向量,由 n 个元素组成,如下所示:

The output of expensive is a character vector, of n elements, that looks like this:

 [1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"                                                                                
 [2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"                         
 [3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8" 

就其价值而言,其动机是有效地从稀疏矩阵格式写入数据 - 无论是从 slamMatrix,但从 slam - 转换为 libsvm 格式(这是上面的格式,但每一行都以一个数字开头,代表支持向量机的目标变量 - 在本例中省略,因为它不是速度问题的一部分).试图改进这个问题的答案.我分叉了从那里引用的存储库之一,并调整了它的方法以使用 这些函数.测试 表明它工作正常;但它 不按比例放大.

For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.

推荐答案

使用包 data.table.它的 by 与快速排序相结合,使您无需查找相同 i 值的索引.

Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.

res1 <- expensive()


library(data.table)
cheaper <- function() {
  setDT(stm)
  res <- stm[, .(i, jv = paste(j, v, sep = ":"))
      ][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]

  setDF(stm) #clean-up which might not be necessary
  res
}

res2 <- cheaper()

all.equal(res1, res2)
#[1] TRUE

microbenchmark(expensive(),
               cheaper())  
#Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417   100   b
#   cheaper()  15.31835  15.66584  16.16267  15.98363  16.33637  18.35359   100  a 

这篇关于优化 sapply() 或 for()、paste(),有效地将稀疏三元组矩阵转换为 libsvm 格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆