用于测试非线性 SVM 的数据集 [英] Datasets to test Nonlinear SVM

查看:28
本文介绍了用于测试非线性 SVM 的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在实现一个非线性 SVM,我想在一个简单的非线性可分数据上测试我的实现.谷歌没有帮我找到我想要的.你能告诉我在哪里可以找到这样的数据吗?或者至少,我怎样才能手动生成这样的数据?

I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?

谢谢,

推荐答案

嗯,SVM 是两类分类器 - 即,这些分类器将数据放置在单个决策边界的任一侧.

Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.

因此,我建议一个仅包含两个类的数据集(这不是绝对必要的,因为 SVM 可以通过多次(串行)传递分类器来分离两个以上的类,这样做很麻烦这在初始测试期间).

Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).

例如,您可以使用 iris 数据集,在 Scott 的回答中链接到;它由三个类组成,I类与II类和III类线性可分;II 类和 III 类不是线性可分的.如果你想使用这个数据集,为了方便起见,你可能更喜欢删除 I 类(大约前 50 个数据行),所以剩下的是一个二分类系统,其中剩余的两个类不是线性可分的.

So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.

iris 数据集非常小(150 x 4,或 50 行/类 x 四个特征)——取决于您在哪里进行 SVM 原型测试,这可能正是您想要的,或者您可能想要更大的数据集.

The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.

由大型约会网站 eHarmony (没有任何形式的隶属关系).除了 iris 数据,我喜欢将这些数据集用于 SVM 原型评估,因为它们是具有相当多特征的大型数据集,但仍然仅包含两个非线性可分类.

An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.

我知道您可以从两个地方检索此数据.first Site 有一个数据集(PCI 代码下载、第 9 章、matchmaker.csv)包括500 个数据点(行)和六个特征(列).尽管此集合更易于使用,但数据或多或少处于原始"形式,需要进行一些处理才能使用它.

I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.

此数据的第二来源包含两个 eHarmony 数据集,一个其中包含超过 50 万行和 59 个特征.此外,这两个数据集已经过大量处理,因此在将它们提供给 SVM 之前唯一需要完成的任务就是对特征进行例行重新缩放.

The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.

这篇关于用于测试非线性 SVM 的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆