预测班级或班级概率? [英] Predict classes or class probabilities?

查看:75
本文介绍了预测班级或班级概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在将H2O用于分类问题数据集.我正在python 3.6环境中使用H2ORandomForestEstimator对其进行测试.我注意到预测方法的结果是给出0到1之间的值(我假设这是概率).

I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability).

在我的数据集中,目标属性为数字,即True值为1,而False值为0.我确保将类型转换为目标属性的类别,但仍得到相同的结果.

In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result.

然后我修改了代码,仍然使用H2OFrame上的asfactor()方法将目标列转换为因数,结果没有任何变化.

Then I modified to the code to convert the target column to factor using asfactor() method on the H2OFrame still, there wasn't any change on the result.

但是当我分别将target属性中的值分别更改为True和False分别为1和0时,我得到的是预期结果(即输出是分类而不是概率).

But when I changed the values in the target attribute to True and False for 1 and 0 respectively, I was getting the expected result(i.e) the output was the classification rather than the probability.

  • 获得分类预测结果的正确方法是什么?
  • 如果概率是数字目标值的结果,那么在进行多类分类时如何处理?

推荐答案

原则上&从理论上讲,很难软分类(即分别返回 class 和amp; 概率)是不同的方法,每种方法各有优缺点.缺点.例如,请考虑硬分类还是软分类?大利润统一机:

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:

基于边距的分类器在机器学习和分类问题的统计中都很流行.在众多分类器中,有些是 hard 分类器,而有些是 soft 分类器.软分类器显式估计类的条件概率,然后根据估计的概率执行分类.相反,硬分类器直接针对分类决策边界,而不产生概率估计.这两种类型的分类器基于不同的哲学,各有千秋.

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

也就是说,实际上,当今使用的大多数分类器,包括随机森林(我能想到的唯一例外是SVM系列)实际上都是 soft 分类器:它们在底层实际产生的内容是一种类似于概率的量度,然后与隐式阈值(通常在二进制情况下默认为0.5)结合使用,从而给出诸如0/1True/False的硬类成员资格.

That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1 or True/False.

获得分类预测结果的正确方法是什么?

What is the right way to get the classified prediction result?

对于初学者来说,从概率到困难的类总是可能的,但事实并非如此.

For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.

通常来说,鉴于您的分类器实际上是 soft 的分类器,仅获取最终的硬分类(True/False)会使过程具有黑匣子"的味道,原则上应该是不可取的;直接处理产生的概率,并且(重要!)明确控制决策阈值应该是此处的首选方法.根据我的经验,这些都是新手往往会遗忘的精妙之处.例如,请从交叉验证线程分类概率阈值中考虑以下内容:

Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Classification probability threshold:

当您为新样本的每个类别输出概率时,练习的统计部分结束.选择一个阈值以将新观察值分类为1 vs. 0不再是统计的一部分.它是决定组件的一部分.

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

除了像上面这样的软"参数(非预期的双关语)之外,在某些情况下,您需要直接处理潜在的概率和阈值,即二进制分类中默认阈值为0.5的情况会使您误入歧途,特别是当您的课时不平衡时;请在看到较高的AUC却做出错误的预测数据不平衡(及其中的链接)作为这种情况的具体示例.

Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.

说实话,我对您报告的H2O行为感到惊讶(我个人没有使用过),即输出的种类受输入的表示形式影响;事实并非如此,如果确实如此,我们可能会遇到设计不良的问题.例如,比较scikit-learn中的随机森林"分类器,它包括两种不同的方法,

To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict and predict_proba, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict is based on the probability estimates, which have been computed already before).

如果概率是数字目标值的结果,那么在进行多类分类时如何处理?

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

除了简单的阈值不再有意义之外,原则上这里没有新内容.再次,从随机森林 scikit-learn中的predict 文档:

There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict docs in scikit-learn:

预测类别是具有最高平均概率估计的类别

the predicted class is the one with highest mean probability estimate

也就是说,对于3个类别(0, 1, 2),您得到的估计值为[p0, p1, p2](根据概率规则,元素加总为1),而预测的类别是概率最高的类别,例如[0.12, 0.60, 0.28]情况下为#1类.这是一个可重现的示例,其中包含3类虹膜数据集(用于GBM算法和R中,但基本原理相同).

That is, for 3 classes (0, 1, 2), you get an estimate of [p0, p1, p2] (with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).

这篇关于预测班级或班级概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆