R-如何取得列&距离矩阵中匹配元素的列下标 [英] R - How to get row & column subscripts of matched elements from a distance matrix
问题描述
我有一个整数矢量vec1
,并且正在使用dist
函数生成一个远距矩阵.我想在距离矩阵中获取某个值的元素的坐标(行和列).从本质上讲,我想将这对元素分开d距离.例如:
I have an integer vector vec1
and I am generating a distant matrix using dist
function. I want to get the coordinates (row and column) of element of certain value in the distance matrix. Essentially I would like to get the pair of elements that are d-distant apart. For example:
vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)
# 1 2 3 4
#2 1
#3 4 3
#4 10 9 6
#5 15 14 11 5
说,我对向量中相距5个单位的一对元素感兴趣.我想获得分别是距离矩阵的行和坐标2的列的coordinate1.在这个玩具示例中,我希望
Say, I am interested in pair of elements in the vector that are 5 unit apart. I wanted to get the coordinate1 which are the rows and coordinate2 which are the columns of the distance matrix. In this toy example, I would expect
coord1
# [1] 5
coord2
# [1] 4
我想知道是否有一种有效的方法来获取这些值,而不涉及将dist
对象转换为矩阵或遍历矩阵?
I am wondering if there is an efficient way to get these values that doesn't involve converting the dist
object to a matrix or looping through the matrix?
推荐答案
距离矩阵是压缩格式的下三角矩阵,其中下三角按列存储为一维矢量.您可以通过
A distance matrix is a lower triangular matrix in packed format, where the lower triangular is stored as a 1D vector by column. You can check this via
str(distMatrix)
# Class 'dist' atomic [1:10] 1 4 10 15 3 9 14 6 11 5
# ...
即使我们调用dist(vec1, diag = TRUE, upper = TRUE)
,向量也仍然相同;仅打印样式会更改.也就是说,无论您如何调用dist
,都始终会得到一个向量.
Even if we call dist(vec1, diag = TRUE, upper = TRUE)
, the vector is still the same; only the printing styles changes. That is, no matter how you call dist
, you always get a vector.
此答案集中于如何在1D和2D索引之间进行转换,因此您可以使用"dist"对象,而无需先使用as.matrix
使其成为完整的矩阵.如果确实要使其成为矩阵,则对距离对象使用 as.matrix中定义的dist2mat
函数非常慢;如何使其更快?.
This answer focus on how to transform between 1D and 2D index, so that you can work with a "dist" object without first making it a complete matrix using as.matrix
. If you do want to make it a matrix, use the dist2mat
function defined in as.matrix on a distance object is extremely slow; how to make it faster?.
为这些索引转换编写向量化的R函数很容易.我们只需要处理越界"索引,应该为其返回NA
.
It is easy to write vectorized R functions for those index transforms. We only need some care dealing with "out-of-bound" index, for which NA
should be returned.
## 2D index to 1D index
f <- function (i, j, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
k <- (2 * n - j) * (j - 1) / 2 + (i - j)
k[!valid] <- NA_real_
k
}
## 1D index to 2D index
finv <- function (k, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (k >= 1) & (k <= n * (n - 1) / 2)
k_valid <- k[valid]
j <- rep.int(NA_real_, length(k))
j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
cbind(i, j)
}
这些函数在内存使用上非常便宜,因为它们使用索引而不是矩阵.
These functions are extremely cheap in memory usage, as they work with index instead of matrices.
您可以使用
vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)
finv(which(distMatrix == 5), distMatrix)
# i j
#[1,] 5 4
通常,距离矩阵包含浮点数.使用==
判断两个浮点数是否相等是有风险的.阅读为什么这些数字不相等?以了解更多可行的策略.
Generally speaking, a distance matrix contains floating point numbers. It is risky to use ==
to judge whether two floating point numbers are equal. Read Why are these numbers not equal? for more and possible strategies.
在距离对象上使用 as.matrix中提供的dist2mat
函数非常慢;如何使其更快?,我们可能会使用which(, arr.ind = TRUE)
.
Using the dist2mat
function given in as.matrix on a distance object is extremely slow; how to make it faster?, we may use which(, arr.ind = TRUE)
.
library(Rcpp)
sourceCpp("dist2mat.cpp")
mat <- dist2mat(distMatrix, 128)
which(mat == 5, arr.ind = TRUE)
# row col
#5 5 4
#4 4 5
附录:图片的Markdown(需要MathJax支持)
## 2D index to 1D index
The lower triangular looks like this: $$\begin{pmatrix} 0 & 0 & \cdots & 0\\ \times & 0 & \cdots & 0\\ \times & \times & \cdots & 0\\ \vdots & \vdots & \ddots & 0\\ \times & \times & \cdots & 0\end{pmatrix}$$ If the matrix is $n \times n$, then there are $(n - 1)$ elements ("$\times$") in the 1st column, and $(n - j)$ elements in the j<sup>th</sup> column. Thus, for element $(i,\ j)$ (with $i > j$, $j < n$) in the lower triangular, there are $$(n - 1) + \cdots (n - (j - 1)) = \frac{(2n - j)(j - 1)}{2}$$ "$\times$" in the previous $(j - 1)$ columns, and it is the $(i - j)$<sup>th</sup> "$\times$" in the $j$<sup>th</sup> column. So it is the $$\left\{\frac{(2n - j)(j - 1)}{2} + (i - j)\right\}^{\textit{th}}$$ "$\times$" in the lower triangular.
----
## 1D index to 2D index
Now for the $k$<sup>th</sup> "$\times$" in the lower triangular, how can we find its matrix index $(i,\ j)$? We take two steps: 1> find $j$; 2> obtain $i$ from $k$ and $j$.
The first "$\times$" of the $j$<sup>th</sup> column, i.e., $(j + 1,\ j)$, is the $\left\{\frac{(2n - j)(j - 1)}{2} + 1\right\}^{\textit{th}}$ "$\times$" of the lower triangular, thus $j$ is the maximum value such that $\frac{(2n - j)(j - 1)}{2} + 1 \leq k$. This is equivalent to finding the max $j$ so that $$j^2 - (2n + 1)j + 2(k + n - 1) \geq 0.$$ The LHS is a quadratic polynomial, and it is easy to see that the solution is the integer no larger than its first root (i.e., the root on the left side): $$j = \left\lfloor\frac{(2n + 1) - \sqrt{(2n-1)^2 - 8(k-1)}}{2}\right\rfloor.$$ Then $i$ can be obtained from $$i = j + k - \left\{\frac{(2n - j)(j - 1)}{2}\right\}.$$
这篇关于R-如何取得列&距离矩阵中匹配元素的列下标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!