软标签上的 scikit-learn 分类 [英] scikit-learn classification on soft labels

查看:98
本文介绍了软标签上的 scikit-learn 分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据文档,可以为 SGDClassifier 指定不同的损失函数.据我所知,log loss 是一个 cross-entropy 损失函数,理论上可以处理软标签,即以某些概率 [0,1] 给出的标签.>

问题是:是否可以使用 SGDClassifierlog loss 功能来解决软标签的分类问题?如果没有 - 如何使用 scikit-learn 解决这个任务(软标签的线性分类)?

更新:

target 的标记方式和问题的性质,硬标签不能给出好的结果.但这仍然是一个分类问题(不是回归),我不想保留 prediction 的概率解释,所以回归也不能开箱即用.交叉熵损失函数可以自然地处理target中的软标签.scikit-learn 中线性分类器的所有损失函数似乎都只能处理硬标签.

所以问题可能是:

例如,如何为 SGDClassifier 指定我自己的损失函数.似乎 scikit-learn 并没有坚持模块化方法,需要在其源代码中的某处进行更改

解决方案

根据文档

<块引用>

log"损失给出了逻辑回归,一种概率分类器.

通常损失函数的形式为 Loss(prediction, target),其中 prediction 是模型的输出,target 是真实值.在逻辑回归的情况下,prediction(0,1) 上的一个值(即软标签"),而 target01(即硬标签").

因此,在回答您的问题时,这取决于您指的是 prediction 还是 target.一般来说,标签的形式(硬"或软")由为预测选择的算法和目标的手头数据给出.>

如果您的数据具有硬"标签,并且您希望模型输出软"标签(可以设置阈值以提供硬"标签),那么是的,逻辑回归属于这一类别.

>

如果您的数据具有软"标签,那么在使用典型的分类方法(即逻辑回归)之前,您必须选择一个阈值将它们转换为硬"标签.否则,您可以使用模型适合的回归方法来预测软"目标.在后一种方法中,您的模型可能会给出 (0,1) 之外的值,并且必须对此进行处理.

According to the documentation it is possible to specify different loss functions to SGDClassifier. And as far as I understand log loss is a cross-entropy loss function which theoretically can handle soft labels, i.e. labels given as some probabilities [0,1].

The question is: is it possible to use SGDClassifier with log loss function out the box for classification problems with soft labels? And if not - how this task (linear classification on soft labels) can be solved using scikit-learn?

UPDATE:

The way target is labeled and by the nature of the problem hard labels don't give good results. But it is still a classification problem (not regression) and I wan't to keep probabilistic interpretation of the prediction so regression doesn't work out of the box too. Cross-entropy loss function can handle soft labels in target naturally. It seems that all loss functions for linear classifiers in scikit-learn can only handle hard labels.

So the question is probably:

How to specify my own loss function for SGDClassifier, for example. It seems scikit-learn doesn't stick to the modular approach here and changes need to be done somewhere inside it's sources

解决方案

According to the docs,

The ‘log’ loss gives logistic regression, a probabilistic classifier.

In general a loss function is of the form Loss( prediction, target ), where prediction is the model's output, and target is the ground-truth value. In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").

So in answer to your question, it depends on if you are referring to the prediction or target. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction and by the data on hand for target.

If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.

If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1), and this would have to be handled.

这篇关于软标签上的 scikit-learn 分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆