将分类数据传递到Sklearn决策树 [英] Passing categorical data to Sklearn Decision Tree

查看:306
本文介绍了将分类数据传递到Sklearn决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于如何将分类数据编码到Sklearn决策树的文章很多,但是从Sklearn文档中,我们得到了这些信息。

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these


决策树的某些优点是:

Some advantages of decision trees are:

(...)

能够处理数值和分类数据。其他技术通常专用于分析仅具有一种类型的变量的数据集。有关更多信息,请参见算法。

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

但是运行以下脚本

import pandas as pd 
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

输出以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

我知道在R中可以通过Sklearn传递分类数据吗?

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

推荐答案

与接受的答案相反,我更愿意使用Scikit-Learn提供的工具来实现此目的。这样做的主要原因是可以轻松地将它们集成到管道 a>。

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.

Scikit-Learn本身提供了很好的类来处理分类数据。而不是编写自定义函数,您应该使用 LabelEncoder 这是为此目的专门设计的

Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.

请参阅以下代码文档:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"]) 

这会自动将它们编码为用于您的机器学习算法的数字。现在,这也支持从整数返回字符串。您可以通过如下简单地调用 inverse_transform 来做到这一点:

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:

list(le.inverse_transform([2, 2, 1]))

这将返回 ['tokyo','tokyo','paris']

还请注意,对于其他许多分类器,除了决策树之外, (例如逻辑回归或SVM),您希望使用一次性编码对分类变量进行编码一个>。 Scikit-learn也通过 OneHotEncoder 类。

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.

希望这会有所帮助!

这篇关于将分类数据传递到Sklearn决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆