RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数 [英] RandomForestClassfier.fit(): ValueError: could not convert string to float

查看:41
本文介绍了RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定的是一个简单的 CSV 文件:

Given is a simple CSV file:

A,B,C
Hello,Hi,0
Hola,Bueno,1

显然真实的数据集远比这个复杂,但是这个重现了错误.我正在尝试为它构建一个随机森林分类器,如下所示:

Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:

cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)

train_y = test['C'] == 1
train_x = test[cols]

clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)

但我只是在调用 fit() 时得到了这个回溯:

But I just get this traceback when invoking fit():

ValueError: could not convert string to float: 'Bueno'

scikit-learn 版本为 0.16.1.

scikit-learn version is 0.16.1.

推荐答案

在使用 fit 之前,您必须进行一些编码.据说 fit() 不接受字符串,但您解决了这个问题.

You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this.

有几个类可以使用:

  • LabelEncoder : turn your string into incremental value
  • OneHotEncoder : use One-of-K algorithm to transform your String into integer

就我个人而言,我在上面发布了几乎同样的问题StackOverflow 前一段时间.我想要一个可扩展的解决方案,但没有得到任何答案.我选择了对所有字符串进行二值化的 OneHotEncoder.它非常有效,但如果您有很多不同的字符串,矩阵将增长得非常快,并且需要内存.

Personally I have post almost the same question on StackOverflow some time ago. I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

这篇关于RandomForestClassfier.fit(): ValueError: 无法将字符串转换为浮点数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆