如何将数据帧发送到scikit进行交叉验证? [英] How to send a dataframe to scikit for cross validation?
问题描述
我想绘制去除样本(行)的效果.有人称其为学习曲线".
I want to plot the effect of removing samples (rows). Some people call it a "learning curve".
所以我想到了使用Pandas删除一些行. 如何随机删除,数据框但每个标签都有行?
So I thought of using Pandas to remove some rows. How to remove, randomly, rows from a dataframe but from each label?
但是当我想进行交叉验证时,会收到以下错误(即使在使用df.values
将数据帧转换为数组之后):
But when I want to do cross validation, I get the following error (even after using df.values
to turn the dataframe into an array):
那么,我在做什么错?
So, what am I doing wrong?
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation
df = pd.DataFrame(np.random.rand(12, 5))
label = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))
X = df1[[0, 1, 2, 3, 4]].values
y = df1.label.values
print(X)
print(y)
clf = neighbors.KNeighborsClassifier()
sss = StratifiedShuffleSplit(1, test_size=0.1)
scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss)
print(scoresSSS)
推荐答案
立即使用sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)
生成对象,而不是迭代对象:
Right off the bat, with sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)
you're generating an object, not an iterable:
>>> type(sss)
<class 'sklearn.model_selection._split.StratifiedShuffleSplit'>
不是给整个对象提供StratifiedShuffleSplit
类(显然它是不可迭代的,因此会产生错误),您需要给它提供该类的.split()
方法的训练/测试输出(
Instead of giving the StratifiedShuffleSplit
class your entire object (which obviously isn't iterable, thus the error), you need to give it the train/test output of the class's .split()
method (docs).
此外,您在StratifiedShuffleSplit
类中的test_size
参数太小.照常使用0.1
会抛出ValueError
,因为您有3个唯一的类,因此测试大小的0.1
不会起作用.最后,您在KNeighbors clf
对象中使用默认的n_neighbors
参数值.使用如此小的数据集时,此默认值太大.使用n_neighbors <= n_samples
,使用您拥有的内容将抛出另一个ValueError
.因此,在下面的示例中,我增大了StratifiedShuffleSplit
对象中的测试大小,将n_neighbors
减小到2,并将可迭代对象从sss.split(X, y)
传递到cross_validation.cross_val_score
的cv
参数.
Further, your test_size
param in your StratifiedShuffleSplit
class is too small. Using 0.1
as you have will throw a ValueError
because you have 3 unique classes, therefore 0.1
for a test size won't do. And lastly, you're using the default n_neighbors
param value in your KNeighbors clf
object. This default value is too large when using such a small data set. Using what you have will throw another ValueError
due to n_neighbors <= n_samples
. So in my example below I've upped the test size in your StratifiedShuffleSplit
object, dropped n_neighbors
down to 2, and passed the iterables from sss.split(X, y)
to cross_validation.cross_val_score
's cv
param.
这就是您希望代码看起来像的样子:
So here is what you want your code to look like:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))
X = df1[[0,1,2,3,4]].values
y = df1.label.values
clf = neighbors.KNeighborsClassifier(n_neighbors=2)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)
scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss.split(X, y))
print(scoresSSS)
我只想说一遍,我不知道您希望获得什么分数,我也绝不是声称这会优化您的分数.但是,这将帮助您摆脱这些错误,从而可以重新开始工作.
Let me just say that I have no idea what score you're looking to get, and by no means am I claiming that this will optimize your score. However, this will help you get rid of those errors so you can get back to work.
这篇关于如何将数据帧发送到scikit进行交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!