距离对象上的as.matrix非常慢;如何使其更快? [英] as.matrix on a distance object is extremely slow; how to make it faster?

查看:117
本文介绍了距离对象上的as.matrix非常慢;如何使其更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到了一个R包Rlof,它使用多线程计算距离矩阵,并且做得很好.

I found an R package Rlof which uses multithreading to calculate distance matrices and it does a wonderful job.

但是,函数distmc的输出是矢量而不是矩阵.将as.matrix应用于此"dist"对象,结果要比多线程距离计算要昂贵得多.

However, the output of the function distmc is a vector rather than a matrix. Applying as.matrix to this "dist" object turns out much more expensive than the multi-threaded calculation of distances.

查看功能帮助,有用于打印对角线和上三角的选项,但是我不知道应该在哪里使用它们.

Looking at the function help, the options for printing the diagonal and upper triangle are there, but I don't understand where they are supposed to be used.

是否有某种方式可以节省as.matrix的时间?

Is there a way of saving the time of as.matrix somehow?

可复制的示例:

set.seed(42)
M1 = matrix(rnorm(15000*20), nrow = 15000, ncol =20)
system.time({dA = distmc(M1, method = "euclidean", diag = TRUE,
                         upper = TRUE, p = 2)})
system.time(A = as.matrix(dA))

推荐答案

dist返回什么?

此函数始终返回一个向量,该向量保持整个矩阵的下三角部分(按列). diagupper参数仅影响打印,即stats:::print.dist.此函数可以打印此矢量,就好像它是矩阵一样.但实际上不是.

What does dist return?

This function always returns a vector, holding the lower triangular part (by column) of the full matrix. The diag or upper argument only affects printing, i.e., stats:::print.dist. This function can print this vector as if it were a matrix; but it is really not.

很难有效地处理三角矩阵或进一步使它们在R核中对称.如果矩阵很大,lower.triupper.tri可能会占用内存: R:将矩阵的上三角部分转换为对称矩阵.

It is hard to efficiently work with triangular matrices or to further make them symmetric in R core. lower.tri and upper.tri can be memory consuming if your matrix is large: R: Convert upper triangular part of a matrix to symmetric matrix.

将"dist"对象转换为矩阵的情况更糟.查看stats:::as.matrix.dist:

Converting a "dist" object to a matrix is worse. Look at the code of stats:::as.matrix.dist:

function (x, ...) 
{
    size <- attr(x, "Size")
    df <- matrix(0, size, size)
    df[row(df) > col(df)] <- x
    df <- df + t(df)
    labels <- attr(x, "Labels")
    dimnames(df) <- if (is.null(labels)) 
    list(seq_len(size), seq_len(size))
    else list(labels, labels)
    df
}

使用rowcolt是一场噩梦.最后的"dimnames<-"生成另一个大的临时矩阵对象.当内存成为瓶颈时,一切都会变慢.

The use of row, col and t are a nightmare. And the final "dimnames<-" generates another big temporary matrix object. When memory becomes a bottleneck, everything is slow.

尴尬的是,使用完整矩阵更容易,因此我们需要它.考虑以下示例: R-如何获取行&距离矩阵中匹配元素的列下标.如果我们尝试避免形成完整的矩阵,则操作很棘手.

The awkward thing is that it is easier to work with a full matrix so we want it. Consider this example: R - How to get row & column subscripts of matched elements from a distance matrix. Operations are tricky if we try to avoid forming the complete matrix.

我编写了一个Rcpp函数dist2mat(请参见此答案末尾的"dist2mat.cpp"源文件).

I write an Rcpp function dist2mat (see "dist2mat.cpp" source file in the end of this answer).

该函数有两个输入:"dist"对象x和(整数)缓存阻止因子bf.该函数的工作方式是先创建一个矩阵并填充其下三角部分,然后将下三角部分复制到上三角以使其对称.第二步是典型的转置操作,对于大型矩阵缓存,值得进行阻塞.只要缓存大小不会太小或太大,性能就应该对其不敏感.通常选择128或256.

The function has two inputs: a "dist" object x and a (integer) cache blocking factor bf. The function works by first creating a matrix and fill in its lower triangular part, then copying the lower triangular part to upper triangular to make it symmetric. The second step is a typical transposition operation and for large matrix cache blocking is worthwhile. Performance should be insensitive to the cache blocking factor as long as it is not too small or too large. 128 or 256 is generally a good choice.

这是我第一次尝试使用Rcpp.我曾经是使用R的常规C接口的C程序员.但是我也要熟悉Rcpp.由于您不知道如何编写已编译的代码,因此您可能也不知道如何运行Rcpp函数.您需要

This is my first try with Rcpp. I have been a C programmer using R's conventional C interface. But I am on my way to get familiar with Rcpp as well. Given that you don't know how to write compiled code, your probably also don't know how to run Rcpp functions. You need to

  1. 安装Rcpp软件包(不确定是否进一步需要 Rtools )如果您使用的是Windows);
  2. 将我的"dist2mat.cpp"复制到R当前工作目录下的文件中(您可以在R会话中从getwd()获取它). ".cpp"文件只是纯文本文件,因此您可以使用任何文本编辑器创建,编辑和保存它.
  1. install Rcpp package (not sure if you further need Rtools if you are on Windows);
  2. copy my "dist2mat.cpp" into a file under your R's current working directory (you can get it from getwd() in your R session). A ".cpp" file is just a plain text file so you can create, edit and save it with any text editor.

现在让我们开始展示.

library(Rcpp)
sourceCpp("dist2mat.cpp")  ## this takes some time; be patient

## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128)  ## cache blocking factor = 128
A
#          a         b         c         d         e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

结果矩阵保留传递给dist的原始矩阵/数据帧的行名.

The resulting matrix preserves the row names of your original matrix / data frame passed to dist.

您可以调整计算机上的缓存阻止因子.请注意,对于小型矩阵,缓存阻塞的影响并不明显.在这里,我尝试了10000 x 10000.

You can tune the cache blocking factor on your machine. Note that the effects of cache blocking is not evident for small matrices. Here I tried a 10000 x 10000 one.

## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

system.time(A <- dist2mat(x, 64))
#   user  system elapsed 
#  0.676   0.424   1.113 

system.time(A <- dist2mat(x, 128))
#   user  system elapsed 
#  0.532   0.140   0.672 

system.time(A <- dist2mat(x, 256))
#   user  system elapsed 
#  0.616   0.140   0.759 

我们可以使用as.matrixdist2mat进行基准测试.由于as.matrix占用大量RAM,因此我在此处使用一个小示例.

We can benchmark dist2mat with as.matrix. As as.matrix is RAM consuming, I use a small example here.

## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
#  expression         min   mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>         <bch:tm> <bch:> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, …   24.6ms   26ms  25.8ms  37.1ms     38.4     30.5MB     0    20
#2 as.matrix(x)   154.5ms  155ms 154.8ms 154.9ms      6.46   160.3MB     0     4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
##   time <list>, gc <list>

请注意dist2mat如何更快(请参阅平均值",中位数"),以及所需的RAM较少(请参见"mem_alloc").我已将check = FALSE设置为禁用结果检查,因为dist2mat不返回"dimnames"属性("dist"对象没有此类信息),而as.matrix却不返回(将1:2000设置为"dimnames"),因此它们并不完全相等.但是您可以验证它们都是正确的.

Note how dist2mat is faster (see "mean", "median"), and also how less RAM (see "mem_alloc") it needs. I've set check = FALSE to disable result checking, because dist2mat returns no "dimnames" attribute (the "dist" object has no such info) but as.matrix does (it sets 1:2000 as "dimnames"), so they are not exactly equal. But you can verify that they are both correct.

A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0


"dist2mat.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix dist2mat(NumericVector& x, int bf) {

  /* input validation */
  if (!x.inherits("dist")) stop("Input must be a 'dist' object");
  int n = x.attr("Size");
  if (n > 65536) stop("R cannot create a square matrix larger than 65536 x 65536");

  /* initialization */
  NumericMatrix A(n, n);

  /* use pointers */
  size_t j, i, jj, ni, nj; double *ptr_x = &x[0];
  double *A_jj, *A_ij, *A_ji, *col, *row, *end;

  /* fill in lower triangular part */
  for (j = 0; j < n; j++) {
    col = &A(j + 1, j); end = &A(n, j);
    while (col < end) *col++ = *ptr_x++;
    }

  /* cache blocking factor */
  size_t b = (size_t)bf;

  /* copy lower triangular to upper triangular; cache blocking applied */
  for (j = 0; j < n; j += b) {
    nj = n - j; if (nj > b) nj = b;
    /* diagonal block has size nj x nj */
    A_jj = &A(j, j);
    for (jj = nj - 1; jj > 0; jj--, A_jj += n + 1) {
      /* copy a column segment to a row segment */
      col = A_jj + 1; row = A_jj + n;
      for (end = col + jj; col < end; col++, row += n) *row = *col;
      }
    /* off-diagonal blocks */
    for (i = j + nj; i < n; i += b) {
      ni = n - i; if (ni > b) ni = b;
      /* off-diagonal block has size ni x nj */
      A_ij = &A(i, j); A_ji = &A(j, i);
      for (jj = 0; jj < nj; jj++) {
        /* copy a column segment to a row segment */
        col = A_ij + jj * n; row = A_ji + jj;
        for (end = col + ni; col < end; col++, row += n) *row = *col;
        }
      }
    }

  /* add row names and column names */
  A.attr("dimnames") = List::create(x.attr("Labels"), x.attr("Labels"));

  return A;
  }

这篇关于距离对象上的as.matrix非常慢;如何使其更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆