R和Java + WEKA之间的最近邻居计算差异 [英] A discrepancy in computing nearest neighbours between R and Java + WEKA

查看:71
本文介绍了R和Java + WEKA之间的最近邻居计算差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在调试一个库以及涉及计算k最近邻的另一种实现方式.我用一个我很难理解的例子来说明这个问题.

I am in the process of debugging a library and another implementation which involves computing k-nearest neighbours. I am framing the question with an example which I am having difficulty to understand.

首先,我将通过一个玩具示例来说明这一点,然后显示将导致问题的输出.

First I will explain demonstrate the thing with a toy example, then show the output which will lead to the question.

此处的演示读取具有10个二维数据点的csv文件.任务是找到所有数据点距第一个数据点的距离,并以非降序列出所有点距第一个数据点的距离.

The demo here reads a csv file having 10 number of 2-dimensional datapoints. The task is to find the distance of all the datapoints from the first datapoint, and list all the points and the distances from the first datapoint in non-decreasing order.

基本上,这是基于kNN的算法的组成部分,当我执行Java版本(库的组成部分)以及用R编写它时,我会发现不一致之处.为了演示该差异,请考虑以下代码.

Basically, this is a component of a kNN based algorithm, and I find a discrepancy when I execute a Java version (component of a library) and when I write it in R. To demonstrate the discrepancy, consider the following codes.

以下代码使用Java和 WEKA .我已经使用 LinearNNSearch 来计算最近的邻居.使用它的原因是因为 LinearNNSearch 是在我正在调试和/或与R代码进行比较的特定库中使用.

The following code uses Java and WEKA. I have used LinearNNSearch to compute the nearest neighbours. The cause of using this is because the LinearNNSearch is used in the specific library which I am debugging and/or comparing with the R code.

import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.core.DistanceFunction;
import weka.core.EuclideanDistance;
import weka.core.Instances;
import weka.core.neighboursearch.LinearNNSearch;
import java.io.File;

class testnn
{
  public static void main (String args[]) throws Exception
  {
    // Load csv
    CSVLoader loader = new CSVLoader ();
    loader.setSource (new File (args[0]));

    Instances df = loader.getDataSet ();

    // Set the LinearNNSearch object
    EuclideanDistance dist_obj = new EuclideanDistance ();

    LinearNNSearch lnn = new LinearNNSearch ();
    lnn.setDistanceFunction(dist_obj);
    lnn.setInstances(df);
    lnn.setMeasurePerformance(false);

    // Compute the K-nearest neighbours of the first datapoint (index 0).
    Instances knn_pts = lnn.kNearestNeighbours (df.instance (0), df.numInstances ());

    // Get the distances.
    double [] dist_arr = lnn.getDistances ();

    // Print
    System.out.println ("Points sorted in increasing order from ");
    System.out.println (df.instance (0));
    System.out.println ("V1,\t" + "V2,\t" + "dist");
    for (int j = 0; j < knn_pts.numInstances (); j++)
    {
      System.out.println (knn_pts.instance (j) + "," + dist_arr[j]);
    }
  }
}

我使用 dist .使用雏菊也会得到相同的结果答案.

To compute the distances I have used dist. Using daisy also gets the identical answer.

// Read file
df <- read.csv ("dat.csv", header = TRUE);

// All to all distances, and select distances of points from  first datapoint (index 1)
dist_mat <- as.matrix (dist (df, diag=TRUE, upper=TRUE, method="euclidean"));
first_pt_to_all <- dist_mat[,1];

// Sort the datapoints and also record the ordering
sorted_order <- sort (first_pt_to_all, index.return = TRUE, decreasing = FALSE);

// Prepare dataset with the datapoints ordered in the non-decreasing order of the distance from the first datapoint
df_sorted <- cbind (df[sorted_order$ix[-1],], dist = sorted_order$x[-1]);

// Print
print ("Points sorted in increasing order from ");
print (df[1,]);

print (df_sorted);

输出

为便于比较,我将两个输出并排放置.这两个表都以降序显示点.

Outputs

For easier comparison I am placing the two outputs side by side. Both of the tables display the points in a non-decreasing order.

  • 左侧表由R生成,R输出中最左边的列表示原始数据点索引.
  • 右侧表由Java + WEKA生成.
  • The left hand side table is generated by R, with the leftmost column in the R output indicates the original datapoint index.
  • The right hand side table is generated by Java + WEKA.

     R                                              Java + WEKA
[1] "Points sorted in increasing order from "   Points sorted in increasing order from 
        V1       V2
1 0.560954 0.313231                      0.560954,0.313231
         V1        V2      dist              V1,        V2,     dist
5  0.866816  0.476897 0.3468979          0.866816,0.476897,0.3280721928065624
10 0.262637  0.554558 0.3837079          0.262637,0.554558,0.37871658916675316
4  1.038752  0.396173 0.4849436          1.038752,0.396173,0.43517244797543775
2  0.330345 -0.137681 0.5064604          1.053889,0.486349,0.4795184359817083
7  1.053889  0.486349 0.5224507          1.113799,0.42203,0.506782009966262
6  1.113799  0.422030 0.5634490          0.330345,-0.137681,0.5448256434359463
8  0.416051 -0.338858 0.6679947          0.416051,-0.338858,0.7411841020052856
3  0.870481 -0.302856 0.6894709          0.870481,-0.302856,0.7425541767563134
9  1.386459  0.425101 0.8330507          1.386459,0.425101,0.7451474897289354

问题

距离明显不同,并且某些数据点顺序也不同.

Problem

the distances are clearly different, and some of the datapoint ordering are also different.

我已经绘制了10个点,并根据它们的排序顺序对其进行了编号,并由图中的数字表示.

I have plotted the 10 points and numbered them according to their sorted order, indicated by the numerals in the plot.

  • 黑色文本表示从 R
  • 生成的排序数据集中绘制的点
  • 红色文本表示从 Java + WEKA
  • 生成的排序数据集中绘制的点
  • The black text indicates the points plotted from the sorted dataset generated by R
  • The red text indicates the points plotted from the sorted dataset generated by Java + WEKA

因此4、5和6不同.如果两个数据点是等距的,那么这将解释不同的顺序,但是没有两个点与第一个数据点等距.

Therefore the 4, 5 and 6 differ. If two datapoints were equidistant, then this would have explained the different ordering, but there is no two points which are equidistant from the first datapoint.


"V1", "V2"
0.560954,0.313231
0.330345,-0.137681
0.870481,-0.302856
1.038752,0.396173
0.866816,0.476897
1.113799,0.42203
1.053889,0.486349
0.416051,-0.338858
1.386459,0.425101
0.262637,0.554558

问题

  • 为什么 dist 列中的距离不同,从而导致最近邻点的顺序不同?
  • 您是否可以在代码中找到任何错误,或者我使用库的方式吗?我是否正确使用了这些(尤其是WEKA)?
  • Question

    • Why the distances in the dist columns are different, which leads to a different ordering of the nearest neighbour points?
    • Is there any mistake you can find out in the code, or the way I am using the libraries? Am I using these (especially WEKA) correctly?
    • 如果不清楚或需要更多信息,请发表评论.

      Comment if something is unclear or for more information.

      推荐答案

      如注释中所述,R距离是正确的.问题是WEKA默认值.您曾经使用过:

      As noted in the comments, the R distances are correct. The problem is WEKA defaults. You used:

      EuclideanDistance dist_obj = new EuclideanDistance ();
      

      WEKA中的欧式距离具有默认值的参数.其中之一是DontNormalize=FALSE,即默认情况下,WEKA在计算距离之前会先对数据进行归一化.我在Java中没有太多帮助,所以我将在R中执行此操作.如果缩放数据,以便每个变量的最小值为零,最大值为1,则将获得WEKA提供的距离度量.

      Euclidean distance in WEKA has parameters with defaults. One of them is DontNormalize=FALSE, i.e. by default, WEKA normalizes the data before computing the distance. I am not much help in java so I will do this in R. If you scale the data so that for each variable the minimum is zero and the maximum is one, you will get the distance measures provided by WEKA.

      NData = Data
      NData[,1] = (NData[,1]-min(NData[,1]))/(max(NData[,1])-min(NData[,1]))
      NData[,2] = (NData[,2]-min(NData[,2]))/(max(NData[,2])-min(NData[,2]))
      dist(NData)
      

      这些距离与您显示的WEKA相匹配.要获得与R相同的结果,请在WEKA中查看EuclideanDistance的参数.

      These distances match what you show for WEKA. To get the same result as R, look into the parameters for EuclideanDistance in WEKA.

      这篇关于R和Java + WEKA之间的最近邻居计算差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆