使用R使用不同的距离度量实现KNN [英] Implementing KNN with different distance metrics using R

查看:286
本文介绍了使用R使用不同的距离度量实现KNN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究数据集,以便比较不同距离度量的影响.我正在使用KNN算法.

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.

R中的KNN算法默认使用欧几里得距离.所以我写了我自己的.我想找到最近的邻居和目标之间正确的类标签匹配的数量.

The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.

我首先准备了数据.然后我调用了数据(wdbc_n),我选择了K = 1.我已经用欧几里得距离作为检验.

I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.

library(philentropy)
knn <- function(xmat, k,method){
  n <- nrow(xmat)
  if (n <= k) stop("k can not be more than n-1")
  neigh <- matrix(0, nrow = n, ncol = k)
  for(i in 1:n) {
    ddist<- distance(xmat, method)  
    neigh[i, ] <- order(ddist)[2:(k + 1)]
  }
  return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")

希望获得与本文相似的结果(关于高维空间中距离度量的惊人行为")( https://bib.dbvis.de/uploadedFiles/155.pdf ,第431页,表3).

Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).

我的问题是

我对代码是对还是错?

任何能指导我的建议或参考都将受到高度赞赏.

Any suggestions or reference that will guide me will be highly appreciated.

EDIT

EDIT

我的数据(乳腺癌-威斯康星州)(wdbc)尺寸为

My data (breast-cancer-wisconsin)(wdbc) dimension is

569  32

规范化并删除id和target列后,尺寸为

After normalizing and removing the id and target column the dimension is

dim(wdbc_n)
569  30

训练和考试成绩由

wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]

推荐答案

我对代码是对还是错?

Am I right or wrong with the codes?

您的代码是错误的.

在我最近的PC上,每次调用距离函数大约需要3秒钟,因此我只对k = 3进行了前30行,并且注意到相邻矩阵的每一行都是相同的.这是为什么?看看这一行:

The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:

ddist<- distance(xmat, method)  

每个循环在距离函数处馈入整个xmat矩阵,然后仅使用所得矩阵的第一行.这将计算训练集行之间的距离,并进行n次,丢弃除第一行以外的所有行.这不是您想要做的.应该使用knn算法为测试集中的每一行计算训练集中每一行的距离.

Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.

让我们看一下距离函数的文档:

Let's take a look at the documentation for the distance function:

distance(x,method ="euclidean",p = NULL,test.na = TRUE,单位= "log",est.prob = NULL)

distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit = "log", est.prob = NULL)

x一个数字data.frame或矩阵(存储概率向量)或 数字data.frame或矩阵存储的计数(如果est.prob为 指定).

x a numeric data.frame or matrix (storing probability vectors) or a numeric data.frame or matrix storing counts (if est.prob is specified).

(...)

在nrow(x)= 2的情况下:单个距离值.如果nrow(x)> 2: 一个距离矩阵,存储所有成对概率的距离值 向量比较.

in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 : a distance matrix storing distance values for all pairwise probability vector comparisons.

在特定情况下(knn分类),您想使用2行版本.

In your specific case (knn classification), you want to use the 2 row version.

最后一件事:您使用了 order ,它将返回ddist向量中k个最大距离的位置.我认为您想要的是距离本身,因此您需要使用 sort 而不是 order .

One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.

根据您的代码以及您的代码似乎基于的Lantz(2013)中的示例,这是一个完整的可行解决方案.我随意添加了几行内容来制作一个独立程序.

Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.

独立的工作解决方案

library(philentropy)
normalize <- function(x) {
 return ((x - min(x)) / (max(x) - min(x)))
}

knn <- function(train, test, k, method){
  n.test <- nrow(test)
  n.train <- nrow(train)
  if (n.train + n.test <= k) stop("k can not be more than n-1")
  neigh <- matrix(0, nrow = n.test, ncol = k) 
  ddist <- NULL
  for(i in 1:n.test) {
    for(j in 1:n.train) {
      xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
      ddist[j] <- distance(as.data.frame(xmat), method, k)  #then we calculate the distance and append it to the ddist vector.
    }
    neigh[i, ] <- sort(ddist)[2:(k + 1)] 
  }
  return(neigh)
}

wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")

请注意,由于对距离函数的调用次数众多(100乘以469),因此该解决方案的速度可能较慢.但是,由于我们一次只向距离函数提供2行,因此它使执行时间易于管理.

Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.

现在行得通吗?

使用自定义knn函数的前两个测试行:

The two first test rows using the custom knn function:

          [,1]      [,2]      [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369

让我们与FNN包中的等效功能进行比较:

Let us compare with the equivalent function in the FNN package:

library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist

          [,1]      [,2]      [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161

结论:不太破旧.

这篇关于使用R使用不同的距离度量实现KNN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆