Scikit NaN或无穷大错误消息 [英] Scikit NaN or infinity error message

查看:67
本文介绍了Scikit NaN或无穷大错误消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从csv文件导入一些数据.该文件具有标有文本"NA"的nan值. 我使用以下方式导入数据:

I'm importing some data from a csv file. The file has nan values flagged with text 'NA'. I import the data with:

X = genfromtxt(data, delimiter=',', dtype=float, skip_header=1)

我使用此代码用预先计算的列均值代替nan.

I the use this code to replace nan with a previosly calculated column mean.

inds = np.where(np.isnan(X))
X[inds]=np.take(col_mean,inds[1])

然后我运行几次检查并获得空数组:

I then run a couple of checks and get empty arrays:

np.where(np.isnan(X))
np.where(np.isinf(X))

最后我运行一个scikit分类器:

Finally I run a scikit classifier:

RF = ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1,verbose=2)
RF.fit(X, y)

并出现以下错误:

  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\ensemble\forest.py", line 257, in fit
    check_ccontiguous=True)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
    _assert_all_finite(array)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

有什么主意为什么告诉我NaN或无穷大? 我阅读了这篇文章,并尝试运行:

Any ideas why it is telling me that there are NaN or infinity? I read this post and tried to run:

RF.fit(X.astype(float), y.astype(float))

但是我遇到了同样的错误.

but I get the same error.

推荐答案

为了提高效率,scikit-learn的决策树将其输入转换为float32,但是您的值不适合该类型:

scikit-learn's decision trees cast their input to float32 for efficiency, but your values won't fit in that type:

>>> np.float32(8.9932064170227995e+41)
inf

解决方案是在使用sklearn.preprocessing.StandardScaler拟合模型之前进行标准化.在进行预测之前,请不要忘记transform.您可以使用sklearn.pipeline.Pipeline在单个对象中组合标准化和分类:

The solution is to standardize prior to fitting a model with sklearn.preprocessing.StandardScaler. Don't forget to transform prior to predicting. You can use a sklearn.pipeline.Pipeline to combine standardization and classification in a single object:

rf = Pipeline([("scale", StandardScaler()),
               ("rf", RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))])

或者,对于当前的开发版本/下一个发行版:

Or, with the current dev version/next release:

rf = make_pipeline(StandardScaler(),
                   RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))

(我承认错误消息可以得到改善.)

(I admit the error message could be improved.)

这篇关于Scikit NaN或无穷大错误消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆