sklearn Logistic回归ValueError:每个样本X具有42个特征;期待1423 [英] sklearn Logistic Regression ValueError: X has 42 features per sample; expecting 1423

查看:1379
本文介绍了sklearn Logistic回归ValueError:每个样本X具有42个特征;期待1423的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试解决问题. 这是我想要做的事情:

I'm stuck trying to fix an issue. Here is what I'm trying to do :

我想使用逻辑回归预测缺失值(Nan)(分类值). 这是我的代码:

I'd like to predict missing values (Nan) (categorical one) using logistic regression. Here is my code :

df_1:我的数据集仅在"Metier"功能中缺少值(缺少我要预测的值)

df_1 : my dataset with missing values only in the "Metier" feature (missing values I'm trying to predict)

X_train = pd.get_dummies(df_1[df_1['Metier'].notnull()].drop(columns='Metier'),drop_first = True)
X_test = pd.get_dummies(df_1[df_1['Metier'].isnull()].drop(columns='Metier'),drop_first = True,dummy_na = True)

Y_train = df_1[df_1['Metier'].notnull()]['Metier']
Y_test = df_1[df_1['Metier'].isnull()]['Metier']

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)

classifier.fit(X_train, Y_train)

classifier.score(X_train,Y_train) = 0.705112088833019

但是当我尝试获取Y_test的预测时,它说:

BUT when I'm trying to get the prediction on Y_test It says :

ValueError:每个样本X具有42个功能;期待1423

ValueError: X has 42 features per sample; expecting 1423

如果有人可以帮我,我将非常感激.

I would highly appreciate If someone could give me a hand.

非常感谢:)

推荐答案

经验法则是从不不要在多个数据帧上使用pandas.get_dummies.它不能保证您具有相同的尺寸.

Rule of thumb is to never use pandas.get_dummies on multiple dataframe. It does not guarantee you the same dimension.

import pandas as pd

print(pd.get_dummies(['a', 'b', 'c']))
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1

print(pd.get_dummies(['b', 'c']))
   b  c
0  1  0
1  0  1

只有先执行pandas.get_dummies 然后划分为x_trainx_test,这才是安全的.但是,您可以使用sklearn.preprocessing.OneHotEncoder:

It is only safe if you do pandas.get_dummies first then divide into x_train and x_test. But instead, you can use sklearn.preprocessing.OneHotEncoder:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)

ohe.fit_transform(np.reshape(['a', 'b', 'c'], (-1, 1)))

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

ohe.transform(np.reshape(['b', 'c'], (-1, 1))) # Its transform, NOT fit_transform
array([[0., 1., 0.],
       [0., 0., 1.]])

请注意,现在它正确断言了两个不同的输入,导致列数相同.

Notice that now it properly asserts two different inputs result in the same number of columns.

这篇关于sklearn Logistic回归ValueError:每个样本X具有42个特征;期待1423的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆