如何从scikits.learn分类器中提取信息,然后在C代码中使用 [英] How to extract info from scikits.learn classifier to then use in C code

查看:107
本文介绍了如何从scikits.learn分类器中提取信息,然后在C代码中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Python中的scikits.learn训练了一堆RBF SVM,然后对结果进行腌制.这些是用于图像处理任务的,我要测试的一件事是在某些测试图像的每个像素上运行每个分类器.也就是说,从以像素(i,j)为中心的窗口中提取特征向量,在该特征向量上运行每个分类器,然后移至下一个像素并重复.这对于使用Python来说太慢了.

I have trained a bunch of RBF SVMs using scikits.learn in Python and then Pickled the results. These are for image processing tasks and one thing I want to do for testing is run each classifier on every pixel of some test images. That is, extract the feature vector from a window centered on pixel (i,j), run each classifier on that feature vector, and then move on to the next pixel and repeat. This is far too slow to do with Python.

说明::当我说这太慢了……"时,我的意思是,即使scikits.learn使用的Libsvm引擎盖下的代码也太慢了.我实际上是在为GPU编写手动决策功能,因此每个像素的分类都是并行进行的.

Clarification: When I say "this is far too slow..." I mean that even the Libsvm under-the-hood code that scikits.learn uses is too slow. I'm actually writing a manual decision function for the GPU so classification at each pixel happens in parallel.

是否可以用Pickle加载分类器,然后获取某种属性来描述如何从特征向量中计算决策,然后将该信息传递给我自己的C代码?对于线性SVM,我可以只提取权重向量和偏差向量,并将它们作为输入添加到C函数中.但是,对于RBF分类器来说,等效的操作是什么?如何从scikits.learn对象中获取该信息?

Is it possible for me to load the classifiers with Pickle, and then grab some kind of attribute that describes how the decision is computed from the feature vector, and then pass that info to my own C code? In the case of linear SVMs, I could just extract the weight vector and bias vector and add those as inputs to a C function. But what is the equivalent thing to do for RBF classifiers, and how do I get that info from the scikits.learn object?

已添加:首次尝试解决方案.

Added: First attempts at a solution.

看起来分类器对象具有属性support_vectors_,其中包含支持向量作为数组的每一行.还有一个属性dual_coef_,它是一个系数乘以len(support_vectors_)的数组.从有关非线性SVM的标准教程中可以看出,应该执行以下操作:

It looks like the classifier object has the attribute support_vectors_ which contains the support vectors as each row of an array. There is also the attribute dual_coef_ which is a 1 by len(support_vectors_) array of coefficients. From the standard tutorials on non-linear SVMs, it appears then that one should do the following:

  • 从被测数据点计算特征向量v.这将是一个与support_vectors_行的长度相同的向量.
  • 对于support_vectors_中的每一行i,计算该支持向量与v之间的平方欧几里德距离d[i].
  • t[i]计算为gamma * exp{-d[i]},其中gamma是RBF参数.
  • 汇总所有i中的dual_coef_[i] * t[i].将scikits.learn分类器的intercept_属性的值添加到此总和.
  • 如果总和为正,则分类为1.否则,分类为0.
  • Compute the feature vector v from your data point under test. This will be a vector that is the same length as the rows of support_vectors_.
  • For each row i in support_vectors_, compute the squared Euclidean distance d[i] between that support vector and v.
  • Compute t[i] as gamma * exp{-d[i]} where gamma is the RBF parameter.
  • Sum up dual_coef_[i] * t[i] over all i. Add the value of the intercept_ attribute of the scikits.learn classifier to this sum.
  • If the sum is positive, classify as 1. Otherwise, classify as 0.

已添加:在编号第9页的此

Added: On numbered page 9 at this documentation link it mentions that indeed the intercept_ attribute of the classifier holds the bias term. I have updated the steps above to reflect this.

推荐答案

是的,您的解决方案看起来还不错.要将numpy数组的原始内存直接传递给C程序,可以使用使用numpy的ctypes helper 或用cython包装您的C程序,并通过传递numpy数组直接调用它(请参见

Yes your solution looks alright. To pass the raw memory of a numpy array directly to a C program you can use the ctypes helpers from numpy or wrap you C program with cython and call it directly by passing the numpy array (see the doc at http://cython.org for more details).

但是,我不确定尝试在GPU上加快预测是最简单的方法:众所周知,内核支持向量机在预测时速度很慢,因为它们的复杂性直接取决于支持向量的数量,而支持向量的数量可能很高.解决高度非线性(多峰)问题.

However, I am not sure that trying to speedup the prediction on a GPU is the easiest approach: kernel support vector machines are known to be slow at prediction time since their complexity directly depend on the number of support vectors which can be high for highly non-linear (multi-modal) problems.

在预测时更快的替代方法包括神经网络(与仅具有2个超参数C和gamma的SVM相比,正确训练的方法可能更复杂或更慢),或者根据与原型的距离进行非线性转换来转换数据+阈值+图像区域上的最大池化(仅用于图像分类).

Alternative approaches that are faster at prediction time include neural networks (probably more complicated or slower to train right than SVMs that only have 2 hyper-parameters C and gamma) or transforming your data with a non linear transformation based on distances to prototypes + thresholding + max pooling over image areas (only for image classification).

第二次阅读Adam Coates的最新论文,并在 kmeans特征提取

for the second read the recent papers by Adam Coates and have a look at this page on kmeans feature extraction

最后,您还可以尝试使用正则化参数nu对拟合模型中支持向量的数量有直接影响的NuSVC模型:更少的支持向量意味着更快的预测时间(不过请检查准确性,最终在预测速度和准确性之间进行权衡).

Finally you can also try to use NuSVC models whose regularization parameter nu has a direct impact on the number of support vectors in the fitted model: less support vectors mean faster prediction times (check the accuracy though, it will be a trade-off between prediction speed and accuracy in the end).

这篇关于如何从scikits.learn分类器中提取信息,然后在C代码中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆