将分类数据传递给 Sklearn 决策树 [英] Passing categorical data to Sklearn Decision Tree
问题描述
有几篇关于如何将分类数据编码为 Sklearn 决策树的帖子,但从 Sklearn 文档中,我们得到了这些
There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
决策树的一些优点是:
(...)
能够处理数字和分类数据.其他技术通常专门用于分析只有一种类型变量的数据集.有关详细信息,请参阅算法.
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
但是运行下面的脚本
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
输出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
我知道在 R 中可以通过 Sklearn 传递分类数据,这可能吗?
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?
推荐答案
与接受的答案相反,我更愿意为此目的使用 Scikit-Learn 提供的工具.这样做的主要原因是它们可以轻松集成到 Pipeline一>.
Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
Scikit-Learn 本身提供了非常好的类来处理分类数据.您应该使用 LabelEncoder 而不是编写自定义函数
这是专门为此目的而设计的.
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder
which is specially designed for this purpose.
参考文档中的以下代码:
Refer to the following code from the documentation:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
这会自动将它们编码为机器学习算法的数字.现在这也支持从整数返回到字符串.你可以通过简单地调用 inverse_transform
来做到这一点,如下所示:
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform
as follows:
list(le.inverse_transform([2, 2, 1]))
这将返回 ['tokyo', 'tokyo', 'paris']
.
另请注意,对于许多其他分类器,除了逻辑回归或 SVM 等决策树之外,您还希望使用 One-Hot 编码.Scikit-learn 也通过 OneHotEncoder
类.
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder
class.
希望这有帮助!
这篇关于将分类数据传递给 Sklearn 决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!