堪培拉距离-结果不一致 [英] canberra distance - inconsistent results

查看:59
本文介绍了堪培拉距离-结果不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解我对堪培拉距离的计算是怎么回事.我编写了自己的简单 canberra.distance 函数,但是结果与 dist 函数不一致.我在函数中添加了 na.rm = T 选项,以便能够在分母为零的情况下计算总和.从?dist 可以理解,它们使用类似的方法:分子和分母为零的项从总和中省略,并视为缺少值.

I'm trying to understand what's going on with my calculation of canberra distance. I write my own simple canberra.distance function, however the results are not consistent with dist function. I added option na.rm = T to my function, to be able calculate the sum when there is zero denominator. From ?dist I understand that they use similar approach: Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.

canberra.distance <- function(a, b){
  sum( (abs(a - b)) / (abs(a) + abs(b)), na.rm = T )
}

a <- c(0, 1, 0, 0, 1)
b <- c(1, 0, 1, 0, 1)
canberra.distance(a, b)
> 3 
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 3.75 


a <- c(0, 1, 0, 0)
b <- c(1, 0, 1, 0)
canberra.distance(a, b)
> 3
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 4   

a <- c(0, 1, 0)
b <- c(1, 0, 1)
canberra.distance(a, b)
> 3
dist(rbind(a, b), method = "canberra")
> 3
# now the results are the same

对0-0和1-1似乎有问题.在第一种情况下(0-0)分子和分母都等于零,因此应该省略该对.在第二种情况下,(1-1)分子为0,但分母不是,因此项也为0,并且总和不应更改.

Pairs 0-0 and 1-1 seem to be problematic. In the first case (0-0) both numerator and denominator are equal to zero and this pair should be omitted. In the second case (1-1) numerator is 0 but denominator is not and the term is then also 0 and the sum should not change.

我在这里想念什么?

为了符合R的定义,可以对函数 canberra.distance 进行如下修改:

To be in line with R definition, function canberra.distance can be modified as follows:

canberra.distance <- function(a, b){
  sum( abs(a - b) / abs(a + b), na.rm = T )
}

但是,结果与以前相同.

However, the results are the same as before.

推荐答案

这可能会揭示出两者之间的区别.据我所知,这是运行距离计算的实际代码

This might shed some light on the difference. As far as I can see this is the actual code being run for computing the distance

static double R_canberra(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist, sum, diff;
    int count, j;

    count = 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        sum = fabs(x[i1] + x[i2]);
        diff = fabs(x[i1] - x[i2]);
        if (sum > DBL_MIN || diff > DBL_MIN) {
        dev = diff/sum;
        if(!ISNAN(dev) ||
           (!R_FINITE(diff) && diff == sum &&
            /* use Inf = lim x -> oo */ (int) (dev = 1.))) {
            dist += dev;
            count++;
        }
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return dist;
}

我认为罪魁祸首是这条线

I think the culprit is this line

if(!ISNAN(dev) ||
               (!R_FINITE(diff) && diff == sum &&
                /* use Inf = lim x -> oo */ (int) (dev = 1.))) 

处理特殊情况,可能没有记录.

which handles a special case and may not be documented.

这篇关于堪培拉距离-结果不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆