用于测试非线性SVM的数据集 [英] Datasets to test Nonlinear SVM

查看:839
本文介绍了用于测试非线性SVM的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在实现一个非线性SVM,我想在一个简单的不可线性分离的数据上测试我的实现。 Google没有帮助我找到我想要的东西。你能告诉我可以在哪里找到这样的数据吗?或者至少,我如何手动生成这样的数据?



谢谢,

解决方案>

那么SVM是两类分类器,即这些分类器将数据放在单个决策边界的两边。



因此,我建议一个只包含两个类的数据集(这并不是绝对必要的,因为当然SVM可以通过多次传递分类器来分隔两个以上的类(串行)数据,在初始测试中这样做很麻烦)。



所以例如,可以使用在Scott的答案中链接的虹膜数据集;它由三类组成,I类可与II类和III类线性分离; II类和III类不是线性分离的。如果要使用此数据集,为方便起见,您可能希望删除Class I(约前50个数据行),所以剩下的是一个两类系统,其中剩下的两个类不能线性分离。



虹膜数据集相当小(150 x 4或50行/ x类四个功能) - 根据您在SVM原型测试中的位置,这可能正是你想要的,或者你可能想要一个更大的数据集。



一组有趣的数据集,仅由两个类组成,绝对是非线性的可分离的是大型交友网站 eHarmony 提供的匿名数据集(没有任何形式的联系)。除了虹膜数据之外,我喜欢使用这些数据集进行SVM原型评估,因为它们是具有很多功能的大型数据集,但仍然包含两个非线性可分离类。



我知道可以从中检索这些数据的两个地方。 第一站点具有单个数据集(PCI Code downloads,chapter9,matchmaker.csv),包含500个数据点(行)和6个功能(列)。虽然这套工具比较简单,但是数据或多或少是以原始形式出现的,需要一些处理才能使用。



第二个来源包含两个eHarmony数据集,其中一个由五百万行和59个功能。此外,这两个数据集已经进行了大量的处理,因此在将它们提供给SVM之前所需的唯一任务是对特征进行常规重新调整。


I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?

Thanks,

解决方案

Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.

Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).

So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.

The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.

An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.

I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.

The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.

这篇关于用于测试非线性SVM的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆