将分类数据传递到Sklearn决策树 [英] Passing categorical data to Sklearn Decision Tree
问题描述
关于如何将分类数据编码到Sklearn决策树的文章很多,但是从Sklearn文档中,我们得到了这些信息。
There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
决策树的某些优点是:
Some advantages of decision trees are:
(...)
能够处理数值和分类数据。其他技术通常专用于分析仅具有一种类型的变量的数据集。有关更多信息,请参见算法。
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
但是运行以下脚本
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
输出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
我知道在R中可以通过Sklearn传递分类数据吗?
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?
推荐答案
与接受的答案相反,我更愿意使用Scikit-Learn提供的工具来实现此目的。这样做的主要原因是可以轻松地将它们集成到管道 a>。
Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
Scikit-Learn本身提供了很好的类来处理分类数据。而不是编写自定义函数,您应该使用 LabelEncoder
这是为此目的专门设计的。
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder
which is specially designed for this purpose.
请参阅以下代码文档:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
这会自动将它们编码为用于您的机器学习算法的数字。现在,这也支持从整数返回字符串。您可以通过如下简单地调用 inverse_transform
来做到这一点:
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform
as follows:
list(le.inverse_transform([2, 2, 1]))
这将返回 ['tokyo','tokyo','paris']
。
还请注意,对于其他许多分类器,除了决策树之外, (例如逻辑回归或SVM),您希望使用一次性编码对分类变量进行编码一个>。 Scikit-learn也通过 OneHotEncoder
类。
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder
class.
希望这会有所帮助!
这篇关于将分类数据传递到Sklearn决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!