R - 如何获得行 &距离矩阵中匹配元素的列下标 [英] R - How to get row & column subscripts of matched elements from a distance matrix

查看:18
本文介绍了R - 如何获得行 &距离矩阵中匹配元素的列下标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个整数向量 vec1 并且我正在使用 dist 函数生成一个远程矩阵.我想获取距离矩阵中某个值的元素的坐标(行和列).本质上,我想获得相距 d 距离的一对元素.例如:

vec1 <- c(2,3,6,12,17)distMatrix <- dist(vec1)# 1 2 3 4#2 1#3 4 3#4 10 9 6#5 15 14 11 5

比如说,我对向量中相距 5 个单位的一对元素感兴趣.我想得到坐标 1 是行和坐标 2 是距离矩阵的列.在这个玩具示例中,我希望

coord1# [1] 5坐标2# [1] 4

我想知道是否有一种不涉及将 dist 对象转换为矩阵或遍历矩阵的有效方法来获取这些值?

解决方案

距离矩阵是压缩格式的下三角矩阵,其中下三角矩阵按列存储为一维向量.您可以通过

查看

str(distMatrix)# Class 'dist' atomic [1:10] 1 4 10 15 3 9 14 6 11 5# ...

即使我们调用dist(vec1, diag = TRUE, upper = TRUE),向量还是一样的;只有打印样式会发生变化.也就是说,无论你如何调用 dist,你总是得到一个向量.

此答案侧重于如何在 1D 和 2D 索引之间进行转换,以便您可以使用dist"对象,而无需先使用 as.matrix 使其成为完整矩阵.如果您确实想让它成为矩阵,请使用

<小时>

R 函数

为这些索引变换编写矢量化的 R 函数很容易.我们只需要处理越界"索引,应该返回NA.

## 二维索引转一维索引f <- 函数 (i, j, dist_obj) {if (!inherits(dist_obj, "dist")) stop("请提供一个'dist'对象")n <- attr(dist_obj, "Size")有效 <- (i >= 1) &(j >= 1) &(i > j) &(i <= n) &(j <= n)k <- (2 * n - j) * (j - 1)/2 + (i - j)k[!valid] <- NA_real_克}## 一维索引转二维索引finv <- 函数(k,dist_obj){if (!inherits(dist_obj, "dist")) stop("请提供一个'dist'对象")n <- attr(dist_obj, "Size")有效 <- (k >= 1) &(k <= n * (n - 1)/2)k_valid <- k[有效]j <- rep.int(NA_real_, length(k))j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1)))/2)i <- j + k - (2 * n - j) * (j - 1)/2cbind(i, j)}

这些函数在内存使用方面非常便宜,因为它们使用索引而不是矩阵.

<小时>

finv 应用于您的问题

你可以使用

vec1 <- c(2,3,6,12,17)distMatrix <- dist(vec1)finv(which(distMatrix == 5), distMatrix)#我j#[1,] 5 4

一般来说,距离矩阵包含浮点数.用==判断两个浮点数是否相等是有风险的.阅读为什么这些数字不相等?了解更多和可能的策略.

<小时>

替代 dist2mat

在距离对象上使用 as.matrix 中给出的 dist2mat 函数非常慢;如何让它更快?,我们可以使用which(, arr.ind = TRUE).

库(Rcpp)sourceCpp("dist2mat.cpp")mat <- dist2mat(distMatrix, 128)which(mat == 5, arr.ind = TRUE)# 行列#5 5 4#4 4 5

<小时>

附录:图片的 Markdown(需要 MathJax 支持)

## 二维索引转一维索引下三角看起来像这样: $$egin{pmatrix} 0 &0 &cdots &0\ 	imes &0 &cdots &0\ 	imes &	imes &cdots &0\ vdots &vdots &ddots &0\ 	imes &	imes &cdots &0end{pmatrix}$$ 如果矩阵是$n 	imes n$,那么第一列就有$(n - 1)$个元素("$	imes$"),$(n - j)第 j中的 $ 元素柱子.因此,对于下三角中的元素 $(i, j)$ (with $i > j$, $j th中的$	imes$"柱子.所以它是 $$left{frac{(2n - j)(j - 1)}{2} + (i - j)
ight}^{	extit{th}}$$ "$次$"在下三角形中.----## 一维索引转二维索引现在是 $k$th"$	imes$" 在下三角中,我们如何找到它的矩阵索引$(i, j)$?我们采取两个步骤:1>找到 $j$;2>从 $k$ 和 $j$ 获得 $i$.$j$th的第一个"$	imes$"列,即 $(j + 1, j)$,是 $left{frac{(2n - j)(j - 1)}{2} + 1
ight}^{	extit{th}}$ "$	imes$" ,因此 $j$ 是最大值,使得 $frac{(2n - j)(j - 1)}{2} + 1 leq k$.这相当于找到最大$j$,使得$$j^2 - (2n + 1)j + 2(k + n - 1) geq 0.$$ LHS是二次多项式,很容易看解是不大于第一个根的整数(即左边的根): $$j = leftlfloorfrac{(2n + 1) - sqrt{(2n-1)^2 - 8(k-1)}}{2}
ight
floor.$$ 那么可以从 $$i = j + k - left{frac{(2n - j)(j - 1)}{2}
ight}.$$

I have an integer vector vec1 and I am generating a distant matrix using dist function. I want to get the coordinates (row and column) of element of certain value in the distance matrix. Essentially I would like to get the pair of elements that are d-distant apart. For example:

vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)

#   1  2  3  4
#2  1         
#3  4  3      
#4 10  9  6   
#5 15 14 11  5

Say, I am interested in pair of elements in the vector that are 5 unit apart. I wanted to get the coordinate1 which are the rows and coordinate2 which are the columns of the distance matrix. In this toy example, I would expect

coord1  
# [1] 5
coord2
# [1] 4

I am wondering if there is an efficient way to get these values that doesn't involve converting the dist object to a matrix or looping through the matrix?

解决方案

A distance matrix is a lower triangular matrix in packed format, where the lower triangular is stored as a 1D vector by column. You can check this via

str(distMatrix)
# Class 'dist'  atomic [1:10] 1 4 10 15 3 9 14 6 11 5
# ...

Even if we call dist(vec1, diag = TRUE, upper = TRUE), the vector is still the same; only the printing styles changes. That is, no matter how you call dist, you always get a vector.

This answer focus on how to transform between 1D and 2D index, so that you can work with a "dist" object without first making it a complete matrix using as.matrix. If you do want to make it a matrix, use the dist2mat function defined in as.matrix on a distance object is extremely slow; how to make it faster?.



R functions

It is easy to write vectorized R functions for those index transforms. We only need some care dealing with "out-of-bound" index, for which NA should be returned.

## 2D index to 1D index
f <- function (i, j, dist_obj) {
  if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
  n <- attr(dist_obj, "Size")
  valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
  k <- (2 * n - j) * (j - 1) / 2 + (i - j)
  k[!valid] <- NA_real_
  k
  }

## 1D index to 2D index
finv <- function (k, dist_obj) {
  if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
  n <- attr(dist_obj, "Size")
  valid <- (k >= 1) & (k <= n * (n - 1) / 2)
  k_valid <- k[valid]
  j <- rep.int(NA_real_, length(k))
  j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1))) / 2)
  i <- j + k - (2 * n - j) * (j - 1) / 2
  cbind(i, j)
  }

These functions are extremely cheap in memory usage, as they work with index instead of matrices.


Applying finv to your question

You can use

vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)

finv(which(distMatrix == 5), distMatrix)
#     i j
#[1,] 5 4

Generally speaking, a distance matrix contains floating point numbers. It is risky to use == to judge whether two floating point numbers are equal. Read Why are these numbers not equal? for more and possible strategies.


Alternative with dist2mat

Using the dist2mat function given in as.matrix on a distance object is extremely slow; how to make it faster?, we may use which(, arr.ind = TRUE).

library(Rcpp)
sourceCpp("dist2mat.cpp")
mat <- dist2mat(distMatrix, 128)
which(mat == 5, arr.ind = TRUE)
#  row col
#5   5   4
#4   4   5


Appendix: Markdown (needs MathJax support) for the picture

## 2D index to 1D index

The lower triangular looks like this: $$egin{pmatrix} 0 & 0 & cdots & 0\ 	imes & 0 & cdots & 0\ 	imes & 	imes & cdots & 0\ vdots & vdots & ddots & 0\ 	imes & 	imes & cdots & 0end{pmatrix}$$ If the matrix is $n 	imes n$, then there are $(n - 1)$ elements ("$	imes$") in the 1st column, and $(n - j)$ elements in the j<sup>th</sup> column. Thus, for element $(i,  j)$ (with $i > j$, $j < n$) in the lower triangular, there are $$(n - 1) + cdots (n - (j - 1)) = frac{(2n - j)(j - 1)}{2}$$ "$	imes$" in the previous $(j - 1)$ columns, and it is the $(i - j)$<sup>th</sup> "$	imes$" in the $j$<sup>th</sup> column. So it is the $$left{frac{(2n - j)(j - 1)}{2} + (i - j)
ight}^{	extit{th}}$$ "$	imes$" in the lower triangular.

----

## 1D index to 2D index

Now for the $k$<sup>th</sup> "$	imes$" in the lower triangular, how can we find its matrix index $(i, j)$? We take two steps: 1> find $j$;  2> obtain $i$ from $k$ and $j$.

The first "$	imes$" of the $j$<sup>th</sup> column, i.e., $(j + 1, j)$, is the $left{frac{(2n - j)(j - 1)}{2} + 1
ight}^{	extit{th}}$ "$	imes$" of the lower triangular, thus $j$ is the maximum value such that $frac{(2n - j)(j - 1)}{2} + 1 leq k$. This is equivalent to finding the max $j$ so that $$j^2 - (2n + 1)j + 2(k + n - 1) geq 0.$$ The LHS is a quadratic polynomial, and it is easy to see that the solution is the integer no larger than its first root (i.e., the root on the left side): $$j = leftlfloorfrac{(2n + 1) - sqrt{(2n-1)^2 - 8(k-1)}}{2}
ight
floor.$$ Then $i$ can be obtained from $$i = j + k - left{frac{(2n - j)(j - 1)}{2}
ight}.$$

这篇关于R - 如何获得行 &amp;距离矩阵中匹配元素的列下标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆