Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值 [英] Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

查看：31 发布时间：2021/12/25 14:54:20 python numpy machine-learning scikit-learn

本文介绍了Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Python scikit-learn 对从 csv 获得的数据进行简单的线性回归.

I'm using Python scikit-learn for simple linear regression on data obtained from csv.

reader = pandas.io.parsers.read_csv("data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

print((np.min(openingPrice)))
print((np.min(closingPrice)))
print((np.max(openingPrice)))
print((np.max(closingPrice)))

peningPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = 
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)


openingPriceTrain = np.reshape(openingPriceTrain,(openingPriceTrain.size,1))

openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
# openingPriceTrain = np.arange(openingPriceTrain, dtype=np.float64)

closingPriceTrain = np.reshape(closingPriceTrain,(closingPriceTrain.size,1))
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = np.reshape(openingPriceTest,(openingPriceTest.size,1))
closingPriceTest = np.reshape(closingPriceTest,(closingPriceTest.size,1))

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

最小值和最大值显示为 0.00.641998.02593.9

The min and max values are showed as 0.0 0.6 41998.0 2593.9

然而我收到这个错误 ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Yet I'm getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

我应该如何消除这个错误?因为从上面的结果来看，它确实不包含无穷大或 Nan 值.

How should I remove this error? Because from the above result it is true that it doesn't contain infinites or Nan values.

有什么办法可以解决这个问题?

What's the solution for this?

all-stocks-cleaned.csv 在 http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv

all-stocks-cleaned.csv is avaliabale at http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv

推荐答案

回归的问题在于，NaN 不知何故已潜入您的数据中.这可以使用以下代码片段轻松检查:

The problem with your regression is that somehow NaN's have sneaked into your data. This could be easily checked with the following code snippet:

import pandas as pd
import numpy as np
from  sklearn import linear_model
from sklearn.cross_validation import train_test_split

reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = 
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)

openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1)
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)

closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1)
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1)
openingPriceTest = openingPriceTest.astype(np.float64, copy=False)

np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any()

(True, True, True)

如果您尝试输入缺失值，如下所示:

If you try imputing missing values like below:

openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)])
closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)])
openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])

您的回归将顺利运行，没有问题:

your regression will run smoothly without a problem:

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

predicted[:5]

array([[ 13598.74748173],
       [ 53281.04442146],
       [ 18305.4272186 ],
       [ 50753.50958453],
       [ 14937.65782778]])

简而言之:正如错误消息所述，您的数据中存在缺失值.

In short: you have missing values in your data, as the error message said.

也许更简单、更直接的方法是在使用 Pandas 读取数据后立即检查是否有任何丢失的数据:

perhaps an easier and more straightforward approach would be to check if you have any missing data right after you read the data with pandas:

data = pd.read_csv('./data/all-stocks-cleaned.csv')
data.isnull().any()
Date                    False
Open                     True
High                     True
Low                      True
Last                     True
Close                    True
Total Trade Quantity     True
Turnover (Lacs)          True

然后用下面两行中的任何一行来插补数据:

and then impute the data with any of the two lines below:

data = data.fillna(lambda x: x.median())

或

data = data.fillna(method='ffill')

这篇关于Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值 [英] Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值 [英] Scikit-learn : Input contains NaN, infinity or a value too large for dtype (&#39;float64&#39;)

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值 [英] Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

登录关闭