获取ValueError:使用scikit Learn的LabelEncoder时y包含新标签 [英] Getting ValueError: y contains new labels when using scikit learn's LabelEncoder
问题描述
我有一个类似的系列:
df['ID'] = ['ABC123', 'IDF345', ...]
我正在使用scikit的LabelEncoder
将其转换为数值,以馈入RandomForestClassifier
.
I'm using scikit's LabelEncoder
to convert it to numerical values to be fed into the RandomForestClassifier
.
在培训期间,我的工作如下:
During the training, I'm doing as follows:
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
但是,现在为了进行测试/预测,当我传入新数据时,我想基于le_id
从此数据中转换"ID",即,如果存在相同的值,则根据上述标签编码器对其进行转换,否则请分配一个新的数值.
But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id
i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.
在测试文件中,我正在执行以下操作:
In the test file, I was doing as follows:
new_df['ID'] = le_dpid.transform(new_df.ID)
但是,出现以下错误:ValueError: y contains new labels
But, I'm getting the following error: ValueError: y contains new labels
我该如何解决?谢谢!
更新:
因此,我要做的任务是使用以下示例(例如)作为训练数据,并预测新的BankNum ID组合的'High', 'Mod', 'Low'
值.该模型应从训练数据集中学习给出高",给出低"的特征.例如,当存在多个具有相同BankNum和不同ID的条目时,将在高"下面给出.
So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low'
values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.
df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
然后通过类似以下内容进行预测:
And then predict it on something like:
BankNum | ID |
00982222 | AB999 |
00982222 | AB999 |
00981111 | AB890 |
我正在做这样的事情:
df['BankNum'] = df.BankNum.astype(np.float128)
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
clf = RandomForestClassifier(random_state=42, n_estimators=140)
clf.fit(X_train, y_train)
推荐答案
我认为错误消息很明确:您的测试数据集包含ID
标签,这些标签尚未包含在训练数据集中.对于此项目,LabelEncoder
找不到合适的数值来表示.有几种方法可以解决此问题.您可以尝试平衡您的数据集,以确保每个标签不仅存在于您的测试中,而且还存在于您的训练数据中.否则,您可以尝试在此处中提出的一种想法.
I think the error message is very clear: Your test dataset contains ID
labels which have not been included in your training data set. For this items, the LabelEncoder
can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.
一种可能的解决方案是,您从头开始搜索数据集,获取所有唯一的ID
值的列表,在该列表上训练LabelEncoder
,并将其余代码保持不变现在就是了.
One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID
values, train the LabelEncoder
on this list, and keep the rest of your code just as it is at the moment.
另一种可能的解决方案是,检查测试数据是否只有在训练过程中看到的标签.如果有一个新标签,则必须将其设置为某个后备值,例如unknown_id
(或类似的东西).这样,您将所有未知的新ID
放在一个类中;对于此项目,预测将失败,但是您可以像现在一样使用其余代码.
An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id
(or something like this). Doin this, you put all new, unknown ID
s in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.
这篇关于获取ValueError:使用scikit Learn的LabelEncoder时y包含新标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!