用H2O存储距离的最佳方法是什么? [英] What is the best way to store distances with H2O?
问题描述
假设我有2个data.frame,并且我想计算它们所有行之间的欧几里得距离.我的代码是:
Supose I have 2 data.frames and I want to calculate the euclidean distance between all of the rows of them. My code is:
set.seed(121)
# Load library
library(h2o)
system.time({
h2o.init()
# Create the df and convert to h2o frame format
df1 <- as.h2o(matrix(rnorm(7500 * 40), ncol = 40))
df2 <- as.h2o(matrix(rnorm(1250 * 40), ncol = 40))
# Create a matrix in which I will record the distances
matrix1 <- as.h2o(matrix(0, nrow = 7500, ncol = 40))
# Loop to calculate all the distances
for (i in 1:nrow(df2)){
matrix1[, i] <- h2o.sqrt(h2o.distance(df1, df2[, i]))
}
})
我敢肯定有一种将其存储到矩阵中的更有效的方法.
I´m sure there is more efficient way to store it into a matrix.
推荐答案
您无需计算循环内的距离,H2O的距离功能可以有效地计算所有行的距离.对于具有n x k
和m x k
尺寸的两个数据框,可以通过以下方式找到n x m
距离矩阵:
You don't need to calculate the distance inside a loop, H2O's distance function can efficiently calculate distances for all the rows. For two data frames with n x k
and m x k
dimensions, you can find the n x m
distance matrix in a following way:
distance_matrix <- h2o.distance(df1, df2, 'l2')
由于-绝对距离(L1范数),"l2"
-欧几里德距离(L2范数),"cosine"
-余弦相似度和"cosine_sq"
-余弦平方相似度.
There is no need to take the square root, since h2o.distance()
function allows you to specify what distance measure to use: "l1"
- Absolute distance (L1 norm), "l2"
- Euclidean distance (L2 norm), "cosine"
- Cosine similarity and "cosine_sq"
- Squared Cosine similarity.
在您的示例之后,用于计算欧几里得距离矩阵的代码将是:
Following your example, the code to calculate the Euclidean distance matrix will be:
library(h2o)
h2o.init()
df1 <- as.h2o(matrix(rnorm(7500 * 40), ncol = 40))
df2 <- as.h2o(matrix(rnorm(1250 * 40), ncol = 40))
distance_matrix <- h2o.distance(df1, df2, 'l2')
产生尺寸为7500 rows x 1250 columns
的矩阵.
这篇关于用H2O存储距离的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!