告诉 LabelEnocder 忽略新标签? [英] Tell LabelEnocder to ignore new labels?

查看:85
本文介绍了告诉 LabelEnocder 忽略新标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理文本数据,其中必须考虑到大量用户错误,例如.在很多情况下,在预测新数据时,由于拼写错误等,会出现编码器以前没有见过的新标签.我只想忽略这些(所以当我运行 labelencoder.transform(df_newdata['GL_Description']),我只是想让它忽略它以前没有见过的任何东西).我怎样才能做到这一点?我在文档中没有找到这个参数,但真的是手动"逐个检查每个单词并删除它们的唯一方法吗?有没有办法告诉编码器忽略任何不在其字典中的新标签?

I'm working with text data where a lot of user error has to be accounted for, eg. there are a lot of cases where upon predicting new data, new labels will occur that the encoder hasn't seen before due to typos etc. I just want to ignore these (so when I run labelencoder.transform(df_newdata['GL_Description']), I just want it to ignore anything it hasn't seen before). How can I do this? I didn't find a parameter for this in the docs, but is the only way really to check every word one-by-one "by hand" and drop them? Is there a way I can tell the encoder to ignore any new labels that are not in its dictionary?

推荐答案

为此,您可以使用自定义编码器覆盖原始 LabelEncoder.像这样:

For that you can override the original LabelEncoder with a custom encoder. Something like this:

import numpy as np
class TolerantLabelEncoder(LabelEncoder):
    def __init__(self, ignore_unknown=False,
                       unknown_original_value='unknown', 
                       unknown_encoded_value=-1):
        self.ignore_unknown = ignore_unknown
        self.unknown_original_value = unknown_original_value
        self.unknown_encoded_value = unknown_encoded_value

    def transform(self, y):
        check_is_fitted(self, 'classes_')
        y = column_or_1d(y, warn=True)

        indices = np.isin(y, self.classes_)
        if not self.ignore_unknown and not np.all(indices):
            raise ValueError("y contains new labels: %s" 
                                         % str(np.setdiff1d(y, self.classes_)))

        y_transformed = np.searchsorted(self.classes_, y)
        y_transformed[~indices]=self.unknown_encoded_value
        return y_transformed

    def inverse_transform(self, y):
        check_is_fitted(self, 'classes_')

        labels = np.arange(len(self.classes_))
        indices = np.isin(y, labels)
        if not self.ignore_unknown and not np.all(indices):
            raise ValueError("y contains new labels: %s" 
                                         % str(np.setdiff1d(y, self.classes_)))

        y_transformed = np.asarray(self.classes_[y], dtype=object)
        y_transformed[~indices]=self.unknown_original_value
        return y_transformed

示例用法:

en = TolerantLabelEncoder(ignore_unknown=True)
en.fit(['a','b'])

print(en.transform(['a', 'c', 'b']))
# Output: [ 0 -1  1]

print(en.inverse_transform([-1, 0, 1]))
# Output: ['unknown' 'a' 'b']

这篇关于告诉 LabelEnocder 忽略新标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆