如何使RandomForestClassifier更快? [英] how to make RandomForestClassifier faster?

查看:641
本文介绍了如何使RandomForestClassifier更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 kaggle 网站,其推特情绪数据大约有100万个原始数据.我已经清理过了,但是最后,当我将特征向量和情感应用于随机森林"分类器时,这会花费很多时间.这是我的代码...

I am trying to implement bag of word model from kaggle site with a twitter sentiments data which has around 1M raw. I already clean it but in last part when I applied my features vectors and sentiments to Random Forest classifier it is taking so much time.here is my code...

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100,verbose=3)
forest = forest.fit( train_data_features, train["Sentiment"] )

train_data_features是1048575x5000稀疏矩阵.我尝试将其转换为数组时表示内存错误.

train_data_features is 1048575x5000 sparse matrix.I tried to converted it into an array while doing it indicates a memory error.

我在哪里做错了?有人可以建议我一些更快的来源吗?我绝对是机器学习的新手,并且没有那么多的编程背景,因此可以参考一些指南.

Where am I doing wrong?Can some suggest me some source or another way to do it faster?I absolutely novice in machine learning and not have that much programming background so some guide will accommodate.

非常感谢您

推荐答案

实际上,解决方案非常简单:获得强大的计算机并并行运行.默认情况下,RandomForestClassifier使用单个线程,但是由于它是完全独立的模型的集合,因此您可以并行训练这100个发束中的每一个.刚刚设置

Actually the solution is pretty straight forward: get strong machine and run it in parallel. By default RandomForestClassifier uses a single thread, but since it is an ensemble of completely independent models you can train each of these 100 tress in parallel. Just set

forest = RandomForestClassifier(n_estimators = 100,verbose=3,n_jobs=-1)

使用所有内核.您还可以限制max_depth,这将加快处理速度(最终您可能会需要这两种方法,因为RF可能严重地过度拟合,而对深度没有任何限制).

to use all of your cores. You can also limit max_depth which will speed things up (in the end you will probably need this either way, since RF can overfit badly without any limitation on depth).

这篇关于如何使RandomForestClassifier更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆