如何将数据帧发送到scikit进行交叉验证? [英] How to send a dataframe to scikit for cross validation?

查看:115
本文介绍了如何将数据帧发送到scikit进行交叉验证?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想绘制去除样本(行)的效果.有人称其为学习曲线".

I want to plot the effect of removing samples (rows). Some people call it a "learning curve".

所以我想到了使用Pandas删除一些行. 如何随机删除,数据框但每个标签都有行?

So I thought of using Pandas to remove some rows. How to remove, randomly, rows from a dataframe but from each label?

但是当我想进行交叉验证时,会收到以下错误(即使在使用df.values将数据帧转换为数组之后):

But when I want to do cross validation, I get the following error (even after using df.values to turn the dataframe into an array):

那么,我在做什么错?

So, what am I doing wrong?

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation

df = pd.DataFrame(np.random.rand(12, 5))
label = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label

df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))

X = df1[[0, 1, 2, 3, 4]].values
y = df1.label.values
print(X)
print(y)

clf = neighbors.KNeighborsClassifier()
sss = StratifiedShuffleSplit(1, test_size=0.1)
scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss)
print(scoresSSS)

推荐答案

立即使用sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)生成对象,而不是迭代对象:

Right off the bat, with sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35) you're generating an object, not an iterable:

>>> type(sss)
    <class 'sklearn.model_selection._split.StratifiedShuffleSplit'>

不是给整个对象提供StratifiedShuffleSplit类(显然它是不可迭代的,因此会产生错误),您需要给它提供该类的.split()方法的训练/测试输出(

Instead of giving the StratifiedShuffleSplit class your entire object (which obviously isn't iterable, thus the error), you need to give it the train/test output of the class's .split() method (docs).

此外,您在StratifiedShuffleSplit类中的test_size参数太小.照常使用0.1会抛出ValueError,因为您有3个唯一的类,因此测试大小的0.1不会起作用.最后,您在KNeighbors clf对象中使用默认的n_neighbors参数值.使用如此小的数据集时,此默认值太大.使用n_neighbors <= n_samples,使用您拥有的内容将抛出另一个ValueError.因此,在下面的示例中,我增大了StratifiedShuffleSplit对象中的测试大小,将n_neighbors减小到2,并将可迭代对象从sss.split(X, y)传递到cross_validation.cross_val_scorecv参数.

Further, your test_size param in your StratifiedShuffleSplit class is too small. Using 0.1 as you have will throw a ValueError because you have 3 unique classes, therefore 0.1 for a test size won't do. And lastly, you're using the default n_neighbors param value in your KNeighbors clf object. This default value is too large when using such a small data set. Using what you have will throw another ValueError due to n_neighbors <= n_samples. So in my example below I've upped the test size in your StratifiedShuffleSplit object, dropped n_neighbors down to 2, and passed the iterables from sss.split(X, y) to cross_validation.cross_val_score's cv param.

这就是您希望代码看起来像的样子:

So here is what you want your code to look like:

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation

df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label

df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))


X = df1[[0,1,2,3,4]].values
y = df1.label.values

clf = neighbors.KNeighborsClassifier(n_neighbors=2)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)

scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss.split(X, y))
print(scoresSSS)

我只想说一遍,我不知道您希望获得什么分数,我也绝不是声称这会优化您的分数.但是,这将帮助您摆脱这些错误,从而可以重新开始工作.

Let me just say that I have no idea what score you're looking to get, and by no means am I claiming that this will optimize your score. However, this will help you get rid of those errors so you can get back to work.

这篇关于如何将数据帧发送到scikit进行交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆