软标签上的 scikit-learn 分类 [英] scikit-learn classification on soft labels
问题描述
根据文档,可以为 SGDClassifier
指定不同的损失函数.据我所知,log loss
是一个 cross-entropy
损失函数,理论上可以处理软标签,即以某些概率 [0,1] 给出的标签.>
问题是:是否可以使用 SGDClassifier
和 log loss
功能来解决软标签的分类问题?如果没有 - 如何使用 scikit-learn 解决这个任务(软标签的线性分类)?
更新:
target
的标记方式和问题的性质,硬标签不能给出好的结果.但这仍然是一个分类问题(不是回归),我不想保留 prediction
的概率解释,所以回归也不能开箱即用.交叉熵损失函数可以自然地处理target
中的软标签.scikit-learn 中线性分类器的所有损失函数似乎都只能处理硬标签.
所以问题可能是:
例如,如何为 SGDClassifier
指定我自己的损失函数.似乎 scikit-learn
并没有坚持模块化方法,需要在其源代码中的某处进行更改
根据文档,
<块引用>log"损失给出了逻辑回归,一种概率分类器.
通常损失函数的形式为 Loss(prediction, target)
,其中 prediction
是模型的输出,target
是真实值.在逻辑回归的情况下,prediction
是 (0,1)
上的一个值(即软标签"),而 target
是 0
或 1
(即硬标签").
因此,在回答您的问题时,这取决于您指的是 prediction
还是 target
.一般来说,标签的形式(硬"或软")由为预测
选择的算法和目标
的手头数据给出.>
如果您的数据具有硬"标签,并且您希望模型输出软"标签(可以设置阈值以提供硬"标签),那么是的,逻辑回归属于这一类别.
>如果您的数据具有软"标签,那么在使用典型的分类方法(即逻辑回归)之前,您必须选择一个阈值将它们转换为硬"标签.否则,您可以使用模型适合的回归方法来预测软"目标.在后一种方法中,您的模型可能会给出 (0,1)
之外的值,并且必须对此进行处理.
According to the documentation it is possible to specify different loss functions to SGDClassifier
. And as far as I understand log loss
is a cross-entropy
loss function which theoretically can handle soft labels, i.e. labels given as some probabilities [0,1].
The question is: is it possible to use SGDClassifier
with log loss
function out the box for classification problems with soft labels? And if not - how this task (linear classification on soft labels) can be solved using scikit-learn?
UPDATE:
The way target
is labeled and by the nature of the problem hard labels don't give good results. But it is still a classification problem (not regression) and I wan't to keep probabilistic interpretation of the prediction
so regression doesn't work out of the box too. Cross-entropy loss function can handle soft labels in target
naturally. It seems that all loss functions for linear classifiers in scikit-learn can only handle hard labels.
So the question is probably:
How to specify my own loss function for SGDClassifier
, for example. It seems scikit-learn
doesn't stick to the modular approach here and changes need to be done somewhere inside it's sources
The ‘log’ loss gives logistic regression, a probabilistic classifier.
In general a loss function is of the form Loss( prediction, target )
, where prediction
is the model's output, and target
is the ground-truth value. In the case of logistic regression, prediction
is a value on (0,1)
(i.e., a "soft label"), while target
is 0
or 1
(i.e., a "hard label").
So in answer to your question, it depends on if you are referring to the prediction
or target
. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction
and by the data on hand for target
.
If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.
If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1)
, and this would have to be handled.
这篇关于软标签上的 scikit-learn 分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!