使用scikit-learn(sklearn),如何处理缺失数据以进行线性回归? [英] Using scikit-learn (sklearn), how to handle missing data for linear regression?
问题描述
我尝试了此操作,但无法使其用于我的数据:使用Scikit了解在时间序列熊猫数据帧上进行线性回归
I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame
我的数据包含2个数据框. DataFrame_1.shape =(40,5000)
和 DataFrame_2.shape =(40,74)
.我正在尝试进行某种类型的线性回归,但是 DataFrame_2
包含 NaN
缺失的数据值.当我 DataFrame_2.dropna(how="any")
时,形状下降到 (2,74)
.
My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000)
and DataFrame_2.shape = (40,74)
. I'm trying to do some type of linear regression, but DataFrame_2
contains NaN
missing data values. When I DataFrame_2.dropna(how="any")
the shape drops to (2,74)
.
sklearn中是否有任何线性回归算法可以处理 NaN
值?
Is there any linear regression algorithm in sklearn that can handle NaN
values?
我在 sklearn.datasets
的 load_boston
之后建模,其中 X,y = boston.data,boston.target =(506,13),(506,)
I'm modeling it after the load_boston
from sklearn.datasets
where X,y = boston.data, boston.target = (506,13),(506,)
这是我的简化代码:
X = DataFrame_1
for col in DataFrame_2.columns:
y = DataFrame_2[col]
model = LinearRegression()
model.fit(X,y)
#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
我执行了上述格式,以使形状与矩阵相匹配
I did the above format to get the shapes to match up of the matrices
如果发布 DataFrame_2
有帮助,请在下面发表评论,我将其添加.
If posting the DataFrame_2
would help, please comment below and I'll add it.
推荐答案
您可以使用插补在 y
中填充空值.在 scikit-learn
中,这是通过以下代码段完成的:
You can fill in the null values in y
with imputation. In scikit-learn
this is done with the following code snippet:
from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)
否则,您可能希望使用74列的子集作为预测变量来构建模型,也许您的某些列包含的空值较少?
Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?
这篇关于使用scikit-learn(sklearn),如何处理缺失数据以进行线性回归?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!