将预测映射回ID-Python Scikit了解DecisionTreeClassifier [英] Map predictions back to IDs - Python Scikit Learn DecisionTreeClassifier

查看:114
本文介绍了将预测映射回ID-Python Scikit了解DecisionTreeClassifier的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有唯一标识符和其他功能的数据集。看起来像这样

I have a dataset that has a unique identifier and other features. It looks like this


ID      LenA TypeA LenB TypeB Diff Score Response
123-456  51   M     101  L     50   0.2   0
234-567  46   S     49   S     3    0.9   1
345-678  87   M     70   M     17   0.7   0


我将其分为训练和测试数据。我正在尝试将训练数据从经过训练数据训练的分类器中分为两类。我想要训练和测试数据集中的标识符,以便可以将预测映射回ID
有没有一种方法可以将标识符列分配为ID或非预测变量,例如我们可以在Azure ML Studio或SAS中进行操作吗?

I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs.
Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?

我正在使用Scikit-Learn的 DecisionTreeClassifier 。这是我用于分类器的代码。

I am using the DecisionTreeClassifier from Scikit-Learn. This is the code I have for the classifier.

from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)

如果我只是将ID包含在 traindata 中,代码会引发错误:

If I just include the ID into the traindata, the code throws an error:


ValueError:无效的float()文字:123-456


推荐答案

不知道您是如何进行拆分的,我建议您只是确保 ID 列不包含在您的训练数据中。也许像这样:

Not knowing how you made your split I would suggest just making sure the ID column is not included in your training data. Something like this perhaps:

X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)

只会从DataFrame中拆分值,而不是 X Response 中的值 code>值,并为 y 值拆分 Response

That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.

但是您将仍然不能在该数据中使用 DecisionTreeClassifier ,因为它包含字符串。您将需要将具有分类数据的任何列(即 TypeA TypeB )转换为数字表示形式。在我看来,对sklearn而言,最好的方法是使用 LabelEncoder 。使用此命令会将分类字符串标签 ['M','S'] 转换为 [1、2] 可以使用 DecisionTreeClassifier 来实现。如果您需要一个示例,请查看将分类数据传递到sklearn决策树

But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.

更新

根据您的评论,我现在知道您需要映射回 ID 。在这种情况下,您可以利用熊猫来发挥自己的优势。将 ID 设置为数据索引,然后进行拆分,这样您将为所有数据保留 ID 值您的火车和测试数据。假设您的数据已经在熊猫数据框中。

Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.

df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
         LenA TypeA  LenB TypeB  Diff  Score
ID
345-678    87     M    70     M    17    0.7
234-567    46     S    49     S     3    0.9

这篇关于将预测映射回ID-Python Scikit了解DecisionTreeClassifier的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆