将预计算的chi2内核与libsvm(matlab)一起使用时的错误结果 [英] bad result when using precomputed chi2 kernel with libsvm (matlab)

查看:70
本文介绍了将预计算的chi2内核与libsvm(matlab)一起使用时的错误结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试libsvm,并按照该示例在软件随附的heart_scale数据上训练svm.我想使用我自己预先计算过的chi2内核.训练数据的分类率降至24%.我确定我正确地计算了内核,但是我想我一定做错了.代码如下.你能看到任何错误吗?帮助将不胜感激.

I am trying libsvm and I follow the example for training a svm on the heart_scale data which comes with the software. I want to use a chi2 kernel which I precompute myself. The classification rate on the training data drops to 24%. I am sure I compute the kernel correctly but I guess I must be doing something wrong. The code is below. Can you see any mistakes? Help would be greatly appreciated.

%read in the data:
[heart_scale_label, heart_scale_inst] = libsvmread('heart_scale');
train_data = heart_scale_inst(1:150,:);
train_label = heart_scale_label(1:150,:);

%read somewhere that the kernel should not be sparse
ttrain = full(train_data)';
ttest = full(test_data)';

precKernel = chi2_custom(ttrain', ttrain');
model_precomputed = svmtrain2(train_label, [(1:150)', precKernel], '-t 4');

这是内核的预先计算方式:

This is how the kernel is precomputed:

function res=chi2_custom(x,y)
a=size(x);
b=size(y);
res = zeros(a(1,1), b(1,1));
for i=1:a(1,1)
    for j=1:b(1,1)
        resHelper = chi2_ireneHelper(x(i,:), y(j,:));
        res(i,j) = resHelper;
    end
end
function resHelper = chi2_ireneHelper(x,y)
a=(x-y).^2;
b=(x+y);
resHelper = sum(a./(b + eps));

使用不同的svm实现(vlfeat),我获得了大约90%的训练数据分类率(是的,我对训练数据进行了测试,只是为了看看发生了什么).所以我很确定libsvm结果是错误的.

With a different svm implementation (vlfeat) I obtain a classification rate on the training data (yes, I tested on the training data, just to see what is going on) around 90%. So I am pretty sure the libsvm result is wrong.

推荐答案

在使用支持向量机时,将数据集标准化作为预处理步骤非常重要. 归一化将属性置于相同的比例,并防止具有较大值的属性对结果产生偏见.它还提高了数值稳定性(最大程度地减少了由于浮点表示而引起的上溢和下溢的可能性).

When working with support vector machines, it is very important to normalize the dataset as a pre-processing step. Normalization puts the attributes on the same scale and prevents attributes with large values from biasing the result. It also improves numerical stability (minimizes the likelihood of overflows and underflows due to floating-point representation).

准确地说,您对卡方内核的计算略有偏差.取而代之的是采用以下定义,并对其使用以下更快的实现:

Also to be exact, your calculation of the Chi-squared kernel is slightly off. Instead take the definition below, and use this faster implementation for it:

function D = chi2Kernel(X,Y)
    D = zeros(size(X,1),size(Y,1));
    for i=1:size(Y,1)
        d = bsxfun(@minus, X, Y(i,:));
        s = bsxfun(@plus, X, Y(i,:));
        D(:,i) = sum(d.^2 ./ (s/2+eps), 2);
    end
    D = 1 - D;
end

现在考虑下面的示例,使用与您相同的数据集(代码来自我的上一个答案):

Now consider the following example using the same dataset as you (code adapted from a previous answer of mine):

%# read dataset
[label,data] = libsvmread('./heart_scale');
data = full(data);      %# sparse to full

%# normalize data to [0,1] range
mn = min(data,[],1); mx = max(data,[],1);
data = bsxfun(@rdivide, bsxfun(@minus, data, mn), mx-mn);

%# split into train/test datasets
trainData = data(1:150,:);    testData = data(151:270,:);
trainLabel = label(1:150,:);  testLabel = label(151:270,:);
numTrain = size(trainData,1); numTest = size(testData,1);

%# compute kernel matrices between every pairs of (train,train) and
%# (test,train) instances and include sample serial number as first column
K =  [ (1:numTrain)' , chi2Kernel(trainData,trainData) ];
KK = [ (1:numTest)'  , chi2Kernel(testData,trainData)  ];

%# view 'train vs. train' kernel matrix
figure, imagesc(K(:,2:end))
colormap(pink), colorbar

%# train model
model = svmtrain(trainLabel, K, '-t 4');

%# test on testing data
[predTestLabel, acc, decVals] = svmpredict(testLabel, KK, model);
cmTest = confusionmat(testLabel,predTestLabel)

%# test on training data
[predTrainLabel, acc, decVals] = svmpredict(trainLabel, K, model);
cmTrain = confusionmat(trainLabel,predTrainLabel)

关于测试数据的结果:

Accuracy = 84.1667% (101/120) (classification)
cmTest =
    62     8
    11    39

在训练数据上,我们可以达到您期望的90%的准确性:

and on the training data, we get around 90% accuracy as you expected:

Accuracy = 92.6667% (139/150) (classification)
cmTrain =
    77     3
     8    62

这篇关于将预计算的chi2内核与libsvm(matlab)一起使用时的错误结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆