NaN突然出现在sklearn KFolds中 [英] NaNs suddenly appearing for sklearn KFolds
问题描述
我正在尝试对数据集进行交叉验证.数据看起来很干净,但是当我尝试运行它时,我的某些数据被NaN取代了.我不知道为什么.有人看过吗?
I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before?
y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']]
X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)
这是我的X数据在KFolds之前的样子:
variation length tempo
0 0.005144 1183.148118 135.999178
1 0.002595 720.165442 117.453835
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
5 0.001620 560.365714 151.999081
6 0.002513 763.377778 107.666016
7 0.009262 502.083628 99.384014
8 0.000610 500.017052 143.554688
9 0.000733 269.001723 117.453835
This is what my X data looked like before KFolds:
variation length tempo
0 0.005144 1183.148118 135.999178
1 0.002595 720.165442 117.453835
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
5 0.001620 560.365714 151.999081
6 0.002513 763.377778 107.666016
7 0.009262 502.083628 99.384014
8 0.000610 500.017052 143.554688
9 0.000733 269.001723 117.453835
我的Y数据如下所示:
array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)
My Y data looks like this:
array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)
现在,当我尝试进行交叉验证时:
Now when I try to do the cross val:
kf = KFold(X_train.shape[0], n_folds=4, shuffle=True)
for train_index, val_index in kf:
cv_train_x = X_train.ix[train_index]
cv_val_x = X_train.ix[val_index]
cv_train_y = y_train[train_index]
cv_val_y = y_train[val_index]
print cv_train_x
logreg = LogisticRegression(C = .01)
logreg.fit(cv_train_x, cv_train_y)
pred = logreg.predict(cv_val_x)
print accuracy_score(cv_val_y, pred)
当我尝试运行此命令时,由于出现以下错误而出错,因此添加了打印语句.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
When I try to run this, I error out with the below error, so I add the print statement.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
在我的打印语句中,这就是它所打印的内容,一些数据变成了NaNs.
variation length tempo
0 NaN NaN NaN
1 NaN NaN NaN
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
In my print statement, this is what it printed, some data became NaNs.
variation length tempo
0 NaN NaN NaN
1 NaN NaN NaN
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
我确定我做错了什么主意吗?一如既往,非常感谢您!
I'm sure I'm doing something wrong, any ideas? As always, thank you so much!
推荐答案
要解决此问题,请使用.iloc
而不是.ix
为您的熊猫数据框建立索引
To solve use .iloc
instead of .ix
to index your pandas dataframe
for train_index, val_index in kf:
cv_train_x = X_train.iloc[train_index]
cv_val_x = X_train.iloc[val_index]
cv_train_y = y_train[train_index]
cv_val_y = y_train[val_index]
print cv_train_x
logreg = LogisticRegression(C = .01)
logreg.fit(cv_train_x, cv_train_y)
pred = logreg.predict(cv_val_x)
print accuracy_score(cv_val_y, pred)
使用ix
进行索引通常等同于使用.loc
,后者是基于标签的索引,而不是基于基于索引的.当.loc
在具有良好的基于整数的索引/标签的X
上运行时,在cv拆分后,此规则不再存在,您将得到类似的信息:
Indexing with ix
is usually equivalent to using .loc
which is label based indexing, not index based. While .loc
works on X
which has a nice integer based indexing/labeling, after cv split this rule is no longer there, you get something like:
length tempo variation
4 509.931973 135.999178 0.001631
2 397.500952 112.347147 0.008146
7 502.083628 99.384014 0.009262
6 763.377778 107.666016 0.002513
5 560.365714 151.999081 0.001620
3 1109.819501 172.265625 0.005367
9 269.001723 117.453835 0.000733
现在您不再具有标签0或1,因此如果您这样做
and now you no longer have label 0 or 1, so if you do
X_train.loc[1]
您将获得例外
KeyError: 'the label [1] is not in the [index]'
但是,如果您请求多个标签(其中至少存在一个 ),则熊猫会出现无提示错误.因此,如果您这样做
However, pandas has a silent error if you request multiple labels, where at least one exists. Thus if you do
X_train.loc[[1,4]]
你会得到
length tempo variation
1 NaN NaN NaN
4 509.931973 135.999178 0.001631
按预期-1返回NaN(因为未找到),4代表实际行-因为它位于X_train内.为了解决这个问题-只需切换到.iloc
或手动重建X_train的索引即可.
As expected - 1 returns NaNs (since it was not found) and 4 represents actual row - since it is inside X_train. In order to solve it - just switch to .iloc
or manually rebuild an index of X_train.
这篇关于NaN突然出现在sklearn KFolds中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!