预测类别或类别概率? [英] Predict classes or class probabilities?

查看:47
本文介绍了预测类别或类别概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在将 H2O 用于分类问题数据集.我正在 python 3.6 环境中使用 H2ORandomForestEstimator 对其进行测试.我注意到预测方法的结果给出了 0 到 1 之间的值(我假设这是概率).

I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability).

在我的数据集中,目标属性是数字,即 True 值为 1,False 值为 0.我确保将类型转换为目标的类别属性,我仍然得到相同的结果.

In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result.

然后我修改了代码,使用 H2OFrame 上的 asfactor() 方法将目标列转换为因子,结果没有任何变化.

Then I modified to the code to convert the target column to factor using asfactor() method on the H2OFrame still, there wasn't any change on the result.

但是当我将目标属性中的值分别更改为 1 和 0 的 True 和 False 时,我得到了预期的结果(即)输出是分类而不是概率.

But when I changed the values in the target attribute to True and False for 1 and 0 respectively, I was getting the expected result(i.e) the output was the classification rather than the probability.

  • 获得分类预测结果的正确方法是什么?
  • 如果概率是数值目标值的结果,那么在多类分类的情况下我该如何处理?

推荐答案

原则上 &理论上,硬&软分类(即分别返回概率)是不同的方法,每种方法都有自己的优点和优点.缺点.例如,请考虑以下论文硬分类还是软分类?大型统一机器:

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:

基于边距的分类器在机器学习和分类问题的统计中都很流行.在众多分类器中,有些是分类器,有些是分类器.软分类器显式地估计类条件概率,然后根据估计的概率进行分类.相比之下,硬分类器直接针对分类决策边界而不产生概率估计.这两种分类器基于不同的理念,各有千秋.

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

也就是说,在实践中,今天使用的大多数分类器,包括随机森林(我能想到的唯一例外是 SVM 系列)实际上都是分类器:它们实际上在下面产生什么是一种类似概率的度量,随后与隐含的阈值(通常在二进制情况下默认为 0.5)相结合,给出了像 0/1真/假.

That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1 or True/False.

得到分类预测结果的正确方法是什么?

What is the right way to get the classified prediction result?

对于初学者来说,从概率到困难类总是有可能的,但事实并非如此.

For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.

一般来说,考虑到你的分类器实际上是一个分类器,得到最后的硬分类(True/False)会给出一个黑色盒子"过程中的味道,原则上应该是不可取的;直接处理产生的概率,并且(重要!)明确控制决策阈值应该是这里的首选方法.根据我的经验,这些微妙之处往往被新从业者忽略;例如考虑以下来自交叉验证线程降低分类概率阈值:

Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Reduce Classification probability threshold:

当您为新样本的每一类输出概率时,练习的统计部分就结束了.选择一个阈值,超过该阈值,您将新观察结果分类为 1 与 0 不再是统计数据的一部分.它是决策组件的一部分.

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

除了软"与上述类似的论点(双关语),在某些情况下,您需要直接处理潜在的概率和阈值,即二元分类中默认阈值 0.5 会导致您误入歧途的情况,最明显的是你的班级不平衡;在 高 AUC 但糟糕的预测中看到我的回答不平衡数据(以及其中的链接)是这种情况的具体例子.

Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.

老实说,我对你报告的 H2O 的行为感到相当惊讶(我个人没有使用过它),即输出的类型受输入表示的影响;这不应该是这种情况,如果确实如此,我们可能会遇到糟糕的设计问题.比较例如 scikit-learn 中的随机森林分类器,它包括两种不同的方法,predictpredict_proba,分别获得硬分类和潜在概率(并检查文档,它是很明显,predict 的输出基于 概率估计,之前已经计算过了).

To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict and predict_proba, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict is based on the probability estimates, which have been computed already before).

如果概率是数值目标值的结果,那么在多类分类的情况下如何处理?

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

这里原则上没有什么新东西,除了简单的阈值不再有意义;再次,来自随机森林 predict scikit-learn 中的文档:

There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict docs in scikit-learn:

预测的类别是平均概率估计值最高的类别

the predicted class is the one with highest mean probability estimate

也就是说,对于 3 个类 (0, 1, 2),您会得到 [p0, p1, p2] 的估计值(元素总和为 1,根据概率规则),并且预测的类别是概率最高的类别,例如[0.12, 0.60, 0.28] 的第 1 类.这是一个可重现的例子3-class iris 数据集(用于 GBM 算法和 R 中,但基本原理相同).

That is, for 3 classes (0, 1, 2), you get an estimate of [p0, p1, p2] (with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).

这篇关于预测类别或类别概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆