NaN突然出现在sklearn KFolds中 [英] NaNs suddenly appearing for sklearn KFolds

查看:115
本文介绍了NaN突然出现在sklearn KFolds中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对数据集进行交叉验证.数据看起来很干净,但是当我尝试运行它时,我的某些数据被NaN取代了.我不知道为什么.有人看过吗?

I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before?

y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']]
X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)

这是我的X数据在KFolds之前的样子: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720.165442 117.453835 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178 5 0.001620 560.365714 151.999081 6 0.002513 763.377778 107.666016 7 0.009262 502.083628 99.384014 8 0.000610 500.017052 143.554688 9 0.000733 269.001723 117.453835

This is what my X data looked like before KFolds: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720.165442 117.453835 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178 5 0.001620 560.365714 151.999081 6 0.002513 763.377778 107.666016 7 0.009262 502.083628 99.384014 8 0.000610 500.017052 143.554688 9 0.000733 269.001723 117.453835

我的Y数据如下所示: array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)

My Y data looks like this: array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)

现在,当我尝试进行交叉验证时:

Now when I try to do the cross val:

kf = KFold(X_train.shape[0], n_folds=4, shuffle=True)

for train_index, val_index in kf:
    cv_train_x = X_train.ix[train_index]
    cv_val_x = X_train.ix[val_index]
    cv_train_y = y_train[train_index]
    cv_val_y = y_train[val_index]
    print cv_train_x

    logreg = LogisticRegression(C = .01)
    logreg.fit(cv_train_x, cv_train_y)
    pred = logreg.predict(cv_val_x)
    print accuracy_score(cv_val_y, pred)

当我尝试运行此命令时,由于出现以下错误而出错,因此添加了打印语句.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

When I try to run this, I error out with the below error, so I add the print statement.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

在我的打印语句中,这就是它所打印的内容,一些数据变成了NaNs. variation length tempo 0 NaN NaN NaN 1 NaN NaN NaN 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178

In my print statement, this is what it printed, some data became NaNs. variation length tempo 0 NaN NaN NaN 1 NaN NaN NaN 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973 135.999178

我确定我做错了什么主意吗?一如既往,非常感谢您!

I'm sure I'm doing something wrong, any ideas? As always, thank you so much!

推荐答案

要解决此问题,请使用.iloc而不是.ix为您的熊猫数据框建立索引

To solve use .iloc instead of .ix to index your pandas dataframe

for train_index, val_index in kf:
    cv_train_x = X_train.iloc[train_index]
    cv_val_x = X_train.iloc[val_index]
    cv_train_y = y_train[train_index]
    cv_val_y = y_train[val_index]
    print cv_train_x

    logreg = LogisticRegression(C = .01)
    logreg.fit(cv_train_x, cv_train_y)
    pred = logreg.predict(cv_val_x)
    print accuracy_score(cv_val_y, pred)

使用ix进行索引通常等同于使用.loc,后者是基于标签的索引,而不是基于基于索引的.当.loc在具有良好的基于​​整数的索引/标签的X上运行时,在cv拆分后,此规则不再存在,您将得到类似的信息:

Indexing with ix is usually equivalent to using .loc which is label based indexing, not index based. While .loc works on X which has a nice integer based indexing/labeling, after cv split this rule is no longer there, you get something like:

        length       tempo  variation
4   509.931973  135.999178   0.001631
2   397.500952  112.347147   0.008146
7   502.083628   99.384014   0.009262
6   763.377778  107.666016   0.002513
5   560.365714  151.999081   0.001620
3  1109.819501  172.265625   0.005367
9   269.001723  117.453835   0.000733

现在您不再具有标签0或1,因此如果您这样做

and now you no longer have label 0 or 1, so if you do

X_train.loc[1]

您将获得例外

KeyError: 'the label [1] is not in the [index]'

但是,如果您请求多个标签(其中至少存在一个 ),则熊猫会出现无提示错误.因此,如果您这样做

However, pandas has a silent error if you request multiple labels, where at least one exists. Thus if you do

 X_train.loc[[1,4]]

你会得到

       length       tempo  variation
1         NaN         NaN        NaN
4  509.931973  135.999178   0.001631

按预期-1返回NaN(因为未找到),4代表实际行-因为它位于X_train内.为了解决这个问题-只需切换到.iloc或手动重建X_train的索引即可.

As expected - 1 returns NaNs (since it was not found) and 4 represents actual row - since it is inside X_train. In order to solve it - just switch to .iloc or manually rebuild an index of X_train.

这篇关于NaN突然出现在sklearn KFolds中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆