是否存在像dist/rdist这样的功能来处理NA? [英] Does a function like dist/rdist exist which handles NAs?

查看:174
本文介绍了是否存在像dist/rdist这样的功能来处理NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用fields包中的rdist函数,但是现在我想像dist函数一样处理矩阵中的NA.

I'm using rdist function from fields package, but now I want to handle NAs in my matrix, like the dist function does.

有这样的功能吗?

一种解决方案是直接使用dist,但是我的矩阵有超过15万行,所以这不是一个选择.

One solution would be to use dist directly, but my matrix has over 150K rows, so this is not an option.

请注意,用complete.casesna.omit删除行或列并不是我要的解决方案.帮助dist函数中描述了预期的行为:

Note than removing rows or columns with complete.cases or na.omit is not the solution I'm looking for. The intended behaviour is described in the help dist function:

允许缺少值,并将其从涉及出现它们的行的所有计算中排除.此外,当涉及Inf值时,当所有对值的距离贡献为NaN或NA时,将排除所有对值.如果在计算欧几里得距离,曼哈顿距离,堪培拉距离或明可夫斯基距离时不包括某些列,则总和将与使用的列数成比例增加.如果在计算特定距离时排除所有对,则值为NA.

我添加了一个示例代码来说明这一点.给定此向量:

I add a sample code to ilustrate this. Given this vectors:

vx <- matrix(c(1,2,3), nrow=1)
vy <- matrix(c(2,7,10), nrow=1)
vy.na <- matrix(c(2,NA,10), nrow=1)

dist计算忽略第二列并向上扩展到3列的距离,所以

dist calculates the distance ignoring the 2nd column and scaling up to a 3 columns, so

dist(rbind(vx,vy))
dist(rbind(vx,vy.na))
rdist(vx,vy)

全部返回相同=> 8.660254

all return the same => 8.660254

但是

rdist(vx,na.omit(vy.na))

不返回任何距离值,因为na.omit省略了整行.

Does not return any distance value because na.omit omits the whole row.

另一方面,分别按向量对计算距离要比rdist慢.

On the other hand calcuating the distances by pairs of vectors individually is a way slower than rdist.

我的替代解决方案是用中性"值(如该列的中位数)填充NA,但我更喜欢dist行为.

My alternate solution is to fill NA with a 'neutral' value (like the median of that column) but I would prefer the dist behaviour.

推荐答案

阅读@deHaas的答案和他的评论后,我可以编写一个有效的rdist版本,将NA视为dist

After reading the answer of @deHaas and his comments I could write an efficient version of rdist that handles NAs as dist

library(pdist)

rdist.w.na <- function(X,Y)
{
  if (!is.matrix(X)) 
    X = as.matrix(X)
  if (!is.matrix(Y)) 
    Y = as.matrix(Y)
  distances <- matrix(pdist(X,Y)@dist, ncol=nrow(X), byrow = TRUE)
  #count NAs
  na.count <- sapply(1:nrow(X),function(i){rowSums(is.na(Y) | is.na(X[i,]))})
  #scaling to number of cols
  distances * sqrt(ncol(X)/(ncol(X) - na.count))
}

特别是rdist.w.na(X,X)等效于dist(X),但是它返回一个完整的对称矩阵,而不是一个较低的三角形矩阵.

In particular rdist.w.na(X,X) is equivalent to dist(X), but it returns a full symmetric matrix instead a lower triangular one.

这篇关于是否存在像dist/rdist这样的功能来处理NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆