ValueError Scikit学习.模型的特征数量与输入不匹配 [英] ValueError Scikit learn. Number of features of model don't match input

查看:202
本文介绍了ValueError Scikit学习.模型的特征数量与输入不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一般的机器学习和特定的scikit-learn,我是一个新手.

I am pretty new to machine learning in general and scikit-learn in specific.

我正在尝试使用网站 http://scikit上给出的示例-learn.org/stable/tutorial/basic/tutorial.html

对于我自己练习,我使用自己的数据集.我的数据集分为两个不同的CSV文件:

For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:

Train_data.csv (包含 32列,最后一列是输出值).

Train_data.csv (Contains 32 columns, the last column is the output value).

Test_data.csv (包含缺少 31列的输出列-应该是这种情况,不是吗?)

Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)

测试数据比训练数据少一列.

我正在使用以下代码来学习(使用训练数据)然后进行预测(使用测试数据).

I am using the following code to learn (using training data) and then predict (using test data).

我面临的问题是错误:

*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*

这是我的代码(很抱歉,如果它看起来完全不对,:()

Here is my code (sorry if it looks completely wrong :( )

import pandas as pd #import the library
from sklearn import svm 

mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"]  #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data


clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target)  #Code from the URL above 

test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column 

clf.predict(test_data[-1:]) #Code from the URL above

培训数据csv标签看起来像这样:

The training data csv labels looks something like this:

值1,值2,值3,值4,输出

测试数据csv标签看起来像这样:

The test data csv labels looks something like this:

值1,值2,值3,值4.

谢谢:)

推荐答案

您的问题是一个监督问题,您有一些以( input,output )形式的数据

Your problem is a Supervised Problem, you have some data in form of (input,output).

输入是描述您的示例的功能,而输出是模型在给定输入后应做出的预测.

The input are the features describing your example and the output is the prediction that your model should respond given that input.

在您的训练数据中,您的csv文件中还将有一个属性,因为要训练您的模型,您需要向他提供输出.

In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.

sklearn中受监督问题的一般工作流程应如下所示

The general workflow in sklearn with a Supervised Problem should look like this

X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]

model.fit(X_train,Y_train)
model.score(X_test, Y_test)

要拆分数据,可以使用 train_test_split ,然后可以使用几个指标来判断模型的性能.

To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.

您应该检查数据的形状

data.shape

似乎您没有考虑到最后3列而不是仅考虑最后3列.尝试尝试:

It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :

data = mydata.ix[:,:-1]

这篇关于ValueError Scikit学习.模型的特征数量与输入不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆