在带有分类值的R中使用k-NN [英] using k-NN in R with categorical values
问题描述
我正在寻找对大多数具有分类特征的数据进行分类的方法.为此,欧几里得距离(或任何其他假定距离的数字)不适合.
I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
我正在寻找[R]的kNN实现,可以在其中选择不同的距离方法,例如汉明距离. 有没有办法使用常见的kNN实现(例如{class}中的实现)和不同的距离度量功能?
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance. Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
我正在使用R 2.15
I'm using R 2.15
推荐答案
只要您可以计算出距离/差异矩阵(以您喜欢的任何方式),就可以轻松地执行kNN分类,而无需任何特殊程序包. /p>
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
如果有人比上面的脏线知道更好的方法来找到向量中最常见的值,我将很高兴知道.
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
在k=1
情况下,需要drop=FALSE
自变量将NN
的子集保留为矩阵.如果没有,它将被转换为向量,并且apply
会引发错误.
The drop=FALSE
argument is needed to preserve the subset of NN
as matrix in the case k=1
. If not it will be converted to a vector and apply
will throw an error.
这篇关于在带有分类值的R中使用k-NN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!