如何使用线性SVM权重进行特征选择 [英] How to do feature selection using linear SVM weights
问题描述
我使用以下代码为两种类型的类(1和0)构建了SVM线性模型:
I have built a SVM linear model for two types of classes (1 and 0), using the following code:
class1.svm.model <- svm(Class ~ ., data = training,cost=1,cross=10, metric="ROC",type="C-classification",kernel="linear",na.action=na.omit,probability = TRUE)
我已经使用以下代码提取了训练集的权重:
and I have extracted the weights for the training set using the following code:
#extract the weights and constant from the SVM model:
w <- t(class1.svm.model$coefs) %*% class1.svm.model$SV;
b <- -1 * class1.svm.model$rho; #(sometimes called w0)
我可以像以下示例一样获得每个功能的权重:
I get weights for each feature like the following example:
X2 0.001710949
X3 -0.002717934
X4 -0.001118897
X5 0.009280056
X993 -0.000256577
X1118 0
X1452 0.004280963
X2673 0.002971335
X4013 -0.004369505
现在如何根据为每个特征提取的权重执行特征选择?我应该如何建立一个权重矩阵?
Now how do I perform feature selection based on the weights extracted for each feature? how shall I build a weight matrix?
我读过论文,但这个概念对我来说还不清楚,请帮忙!
I read papers but the concept is yet not clear to me, Please help!
推荐答案
我很快就把这个答案弄短了,所以我希望其他人可以在很多方面进行扩展,但是这可以帮助您入门...
I've dashed this answer off rather quickly, so I expect there will be quite a few points that others can expand on, but as something to get you started...
有很多方法可以做到这一点,但是要解决的第一件事是将线性权重转换为衡量每个特征对分类的重要程度的度量.这是一个相对简单的三步过程:
There are a number of ways of doing this, but the first thing to tackle is to convert the linear weights into a measure of how important each feature is to the classification. This is a relatively simple three step process:
- 对输入数据进行归一化,以使每个特征的均值= 0,标准差= 1.
- 训练模型
- 取权重的绝对值.也就是说,如果权重为-0.57,则取0.57.
(可选)通过对不同的训练数据集重复上述操作几次,以生成更可靠的功能重要性度量,这些训练数据集是通过对原始训练数据进行随机重新采样而创建的.
Optionally you can generate a more robust measure of feature importance by repeating the above several times on different sets of training data which you have created by randomly re-sampling your original training data.
现在,您可以确定每个功能对分类的重要性,您可以通过多种不同方式使用它来选择要包含在最终模型中的功能.我将举一个消除递归特征"的示例,因为它是我的最爱之一,但是您可能需要研究迭代特征选择或噪声扰动.
Now that you have a way to determine how important each feature is to the classification, you can use this in a number of different ways to select which features to include in your final model. I will give an example of Recursive Feature Elimination, since it is one of my favourites, but you may want to look into iterative feature selection, or noise perturbation.
因此,要执行递归特征消除:
So, to perform recursive feature elimination:
- 首先在整个功能集上训练模型,然后计算其功能重要性.
- 丢弃重要性值最小的特征,然后在其余特征上重新训练模型
- 重复2直到您具有足够少的功能集[1].
[1],其中的特征集很少,这取决于将模型应用于验证集时精度开始受到影响的点.值得注意的是:在进行这种特征选择方法时,请确保不仅拥有单独的训练和测试集,而且还具有用于选择要保留多少个特征的验证集.
[1] where a small enough set of features is determined by the point at which the accuracy begins to suffer when you apply your model to a validation set. On which note: when doing this sort of method of feature selection, make sure that you have not only a separate training and test set, but also a validation set for use in choosing how many features to keep.
这篇关于如何使用线性SVM权重进行特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!