将分类数据传递给 Sklearn 决策树 [英] Passing categorical data to Sklearn Decision Tree

查看:28
本文介绍了将分类数据传递给 Sklearn 决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有几篇关于如何将分类数据编码为 Sklearn 决策树的帖子,但从 Sklearn 文档中,我们得到了这些

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these

决策树的一些优点是:

(...)

能够处理数字和分类数据.其他技术通常专门用于分析只有一种类型变量的数据集.有关详细信息,请参阅算法.

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

但是运行下面的脚本

import pandas as pd 
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

输出以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

我知道在 R 中可以通过 Sklearn 传递分类数据,这可能吗?

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

推荐答案

与接受的答案相反,我更愿意为此目的使用 Scikit-Learn 提供的工具.这样做的主要原因是它们可以轻松集成到 Pipeline一>.

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.

Scikit-Learn 本身提供了非常好的类来处理分类数据.您应该使用 LabelEncoder 而不是编写自定义函数 这是专门为此目的而设计的.

Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.

参考文档中的以下代码:

Refer to the following code from the documentation:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"]) 

这会自动将它们编码为机器学习算法的数字.现在这也支持从整数返回到字符串.你可以通过简单地调用 inverse_transform 来做到这一点,如下所示:

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:

list(le.inverse_transform([2, 2, 1]))

这将返回 ['tokyo', 'tokyo', 'paris'].

另请注意,对于许多其他分类器,除了逻辑回归或 SVM 等决策树之外,您还希望使用 One-Hot 编码.Scikit-learn 也通过 OneHotEncoder 类.

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.

希望这有帮助!

这篇关于将分类数据传递给 Sklearn 决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆