如何在R中计算KNN变量重要性 [英] How to calculate KNN Variable Importance in R

查看:452
本文介绍了如何在R中计算KNN变量重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实施了作者身份归因项目,在那里我能够使用KNN用来自两位作者的文章来训练我的KNN模型.然后,我将新文章的作者分类为作者A或作者B. 我使用knn()函数生成模型. 该模型的输出如下表.

I implemented an Authorship attribution project where I was able to train my KNN model with articles from two authors using KNN. Then, I classify the author of a new article to be either author A or author B. I use knn() function to generate the model. The output of the model is the table below.

   Word1 Word2 Word3  Author
11    1     48    8      A
2     2     0     0      B
29    1     45    9      A
1     2     0     0      B
4     0     0     0      B
28    3     1     1      B

从模型中可以明显看出,Word2和Word3是引起作者A和作者B之间分类的最重要变量.

As seen from the model, it is obvious to see that Word2 and Word3 are the most significant variables that cause the classification between Author A and Author B.

我的问题是如何使用R识别它.

My question is how can I identify this using R.

推荐答案

基本上,您的问题归结为具有一些变量(示例中为Word1,Word2和Word3)和二进制结果(示例中为Author),并且想要了解决定该结果的不同变量的重要性.一种自然的方法是训练回归模型,以使用变量预测结果并检查该模型中变量的重要性.我将在此处包括两种方法(逻辑回归和随机森林),但可以使用许多其他方法.

Basically, your question boils down to having some variables (Word1, Word2, and Word3 in your example) and a binary outcome (Author in your example) and wanting to know the importance of different variables in determining that outcome. A natural approach would be training a regression model to predict the outcome using the variables and to check the variable importance in that model. I'll include two approaches (logistic regression and random forest) here, but many others could be used.

让我们从一个更大的示例开始,在该示例中,结果仅取决于Word2和Word3,而Word2的效果要比Word3大得多:

Let's start with a slightly larger example, in which the outcome only depends on Word2 and Word3, and Word2 has a much larger effect than Word3:

set.seed(144)
dat <- data.frame(Word1=rnorm(10000), Word2=rnorm(10000), Word3=rnorm(10000))
dat$Author <- ifelse(runif(10000) < 1/(1+exp(-10*dat$Word2+dat$Word3)), "A", "B")

我们可以使用Logistic回归模型预测Author的摘要来确定最重要的变量:

We can use the summary of the logistic regression model predicting Author to determine the most important variables:

summary(glm(I(Author=="A")~., data=dat, family="binomial"))
# [snip]
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)  0.05117    0.04935   1.037    0.300    
# Word1       -0.02123    0.04926  -0.431    0.666    
# Word2        9.52679    0.26895  35.422   <2e-16 ***
# Word3       -0.97022    0.05629 -17.236   <2e-16 ***

从p值中,我们可以看到Word2具有较大的正效应,而Word3具有较大的负效应.从系数中我们可以看到Word2对结果的影响更大(由于构造,我们知道所有变量都在相同的尺度上.)

From the p-values, we can see that Word2 has a large positive effect and Word3 has a large negative effect. From the coefficients we can see that Word2 has a higher magnitude of effect on the outcome (since by construction we know all the variables are on the same scale).

我们可以类似地使用来自随机森林的变量重要性来预测作者的结果:

We can use the variable importance from a random forest predicting the Author outcome similarly:

library(randomForest)
rf <- randomForest(as.factor(Author)~., data=dat)
rf$importance
#       MeanDecreaseGini
# Word1         294.9039
# Word2        4353.2107
# Word3         351.3268

我们可以确定Word2是迄今为止最重要的变量.这告诉我们一些有趣的事情-鉴于我们知道Word2,在预测结果方面,Word3实际上并没有比Word1有用得多(并且Word1不应太有用,因为它没有用于计算结果)

We can identify Word2 as by far the most important variable. This tells us something else that's interesting -- given that we know Word2, Word3 actually isn't too much more useful than Word1 in predicting the outcome (and Word1 shouldn't be too useful because it wasn't used to compute the outcome).

这篇关于如何在R中计算KNN变量重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆