RandomForestClassifier错误:功能数量必须与输入匹配 [英] RandomForestClassifier Error: Number of features must match input

查看:458
本文介绍了RandomForestClassifier错误:功能数量必须与输入匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对sklearn比较陌生,一直在尝试使用以下代码:

I'm relatively new to sklearn and have been trying to make use of the following code:

from sklearn.ensemble import RandomForestClassifier
from numpy import genfromtxt, savetxt

def main():
    #create the training & test sets, skipping the header row with [1:]
    dataset = genfromtxt(open('mypath\data1.csv','r'), delimiter=',', dtype='f8')[1:]    
    target = [x[0] for x in dataset]
    train = [x[1:] for x in dataset]
    test = genfromtxt(open('mypath\data1.csv','r'), delimiter=',', dtype='f8')[1:]

    #create and train the random forest
    #multi-core CPUs can use: rf = RandomForestClassifier(n_estimators=100, n_jobs=2)
    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(train, target)

    savetxt(myoutput\data1_output.csv', rf.predict(test), delimiter=',', fmt='%f')

if __name__=="__main__":
    main()

此代码在包含三列的.csv文件上运行随机森林分类器,其中第一列包含标签,而其他两列包含功能.运行此程序时,出现以下错误:

This code runs a random forest classifier on a .csv file containing three columns, the first of which contains labels while the other two contain features. When running this program I get the following error:

ValueError: Number of features of the model must  match the input. Model n_features is 2 and  input n_features is 3

我最初的假设是,有一个名为n_features的组件需要根据使用情况进行调整.但是,它似乎比这更复杂.谁能解释我是否可以以及如何获得上述类型的.csv来成功运行此代码?

My initial assumption was that there was a component named n_features that I would need to adjust to my use case. However, it appears to be more complex than this. Would anyone be able to explain if and how I could get a .csv of the type I described above to run with this code successfully?

我确实看到了这篇文章,这表明问题在于代码将我的标签作为功能包括在内.但是,我不太了解针对该问题提出的解决方案如何解决这一问题,因此不胜感激.

I did see this post, which suggests the issue is that the code is including my labels as a feature. However, I don't really understand how the solution presented to that problem solves this one and so would greatly appreciate additional explanation.

推荐答案

csv文件的形状为(n_examples, 3).调用时,您将此数组分为两个列表,分别包含响应变量和输入变量:

The shape of your csv file is (n_examples, 3). You split this array into two lists containing the response variables and input variables when you call:

target = [x[0] for x in dataset]
train = [x[1:] for x in dataset]

因此,target是形状(n_examples, 1)train是形状(n_examples, 2).接下来,您读取相同的csv文件进行测试(我不知道您为什么要使用训练数据进行测试,或者为什么此时需要再次读取文件).无论如何,这意味着test是形状(n_examples, 3).

Thus, target is shape (n_examples, 1) and train is shape (n_examples, 2). Next, you read in the same csv file to test (I don't know why you're using training data to test or why you need to read the file again at this point). Anyhow, this means that test is shape (n_examples, 3).

预测使用通过调用fit学习的模型参数获取输入并产生响应.因此,predict期望接收形状为(2,)的输入变量列表或形状为(n_examples, 2)的数组.您应该看到现在不匹配的地方.

predict takes inputs and produces responses using the model parameters learned through calling fit. So predict expects to receive a list of input variables of shape (2,) or an array of shape (n_examples, 2). You should see where the mismatch is taking place now.

要修复,请致电rf.predict(test[1:, 1:]).该分片从第1行开始,从第1列开始,从第一行开始,假设第一行包含标题信息,则跳过第一行(应检查标题确实已读入),并跳过每一行的第一列,以跳过响应变量.每个示例.

To fix, call rf.predict(test[1:, 1:]). This slice takes everything from row 1 onwards and everything from column 1 onwards, skipping the first row assuming it contains header info (you should check that the header is indeed read in) and skipping the first column of every row to skip the response variables for each example.

当然,由于测试是从与训练数据相同的文件中读取的,因此等效于rf.predict(train).

Of course, since test was read from the same file as your training data, this is equivalent to rf.predict(train).

这篇关于RandomForestClassifier错误:功能数量必须与输入匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆