python scikit-learn随机森林中如何使用虚拟变量表示分类数据 [英] How to use dummy variable to represent categorical data in python scikit-learn random forest

查看:204
本文介绍了python scikit-learn随机森林中如何使用虚拟变量表示分类数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为 scikit-learn 的随机森林分类器生成特征向量.特征向量表示9个蛋白质氨基酸残基的名称.有 20 个可能的残基名称.所以,我用20个哑变量来代表一个残基名称,9个残基,我有180个哑变量.

I'm generating feature vector for random forest classifier of scikit-learn . The feature vector represents the name of 9 protein amino acid residues. There are 20 possible residue names. So, I use 20 dummy variables to represent one residue name, for 9 residue, I have 180 dummy variables.

例如,如果滑动窗口中的 9 个残基是:ARNDCQEGH(每个字母代表一个蛋白质残基的名称),我的特征向量将是:

For example, if the 9 residues in the sliding window are: ARNDCQEGH (every one letter represent a name of a protein residue),my feature vector will be:

"True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	
False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False
" 

另外,我尝试用 (1,0) 替换 (True,False)

Also, I tried to use (1,0) to replace (True,False)

在训练和测试 Scikit 的随机森林分类器模型后,我发现它完全不起作用.但是 Scikit 的随机森林可以处理我的其他数值数据.

After training and testing Scikit's random forest classifier model, I found it totally did not work. But Scikit's random forest can work with my other numerical data.

Scikit 的随机森林可以处理分类变量或虚拟变量吗?如果是这样,您能否提供一个示例来说明它是如何工作的.

Can Scikit's random forest deal with categorical variable or dummy variable? If so, could you provide an example showing how it works.

这是我设置随机森林的方法:

Here is how I set the random forest:

clf=RandomForestClassifier (n_estimators=800, criterion='gini', n_jobs=12, max_depth=None, compute_importances=True, max_features='auto', min_samples_split=1,  random_state=None)

非常感谢!

推荐答案

使用编码为 0 和 1 的布尔特征应该可以工作.如果即使您的森林中有大量决策树,预测准确性也很差,则可能是您的数据过于嘈杂,无法让学习算法不提取任何有趣的想法.

Using boolean features encoded as 0 and 1 should work. If the predictive accuracy is bad even with a large number of decision trees in your forest it might be the case that your data is too noisy to get the learning algorithm to not pickup any think interesting.

您是否尝试过拟合线性模型(例如 Logistic 回归)作为此数据的基线?

Have you tried to fit a linear model (e.g. Logistic Regression) as a baseline on this data?

编辑:实际上,对分类变量使用整数编码对于许多随机决策树模型(例如 scikit-learn 中的 RandomForest 和 ExtraTrees)往往效果很好.

Edit: in practice using integer coding for categorical variables tends to work very well for many randomized decision trees models (such as RandomForest and ExtraTrees in scikit-learn).

这篇关于python scikit-learn随机森林中如何使用虚拟变量表示分类数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆