与DecisionTreeRegressor的预报_等效 [英] Equivalent of predict_proba for DecisionTreeRegressor

查看:312
本文介绍了与DecisionTreeRegressor的预报_等效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scikit-learn的 DecisionTreeClassifier 支持通过 predict_proba()函数预测每个类别的概率。 DecisionTreeRegressor 不存在:

scikit-learn's DecisionTreeClassifier supports predicting probabilities of each class via the predict_proba() function. This is absent from DecisionTreeRegressor:


AttributeError:'DecisionTreeRegressor'对象没有属性'predict_proba'

AttributeError: 'DecisionTreeRegressor' object has no attribute 'predict_proba'

我的理解是决策树分类器和回归器之间的底层机制非常相似,主要区别在于预测来自回归变量的值被计算为潜在叶子的平均值。因此,我希望有可能提取每个值的概率。

My understanding is that the underlying mechanics are pretty similar between decision tree classifiers and regressors, with the main difference being that predictions from the regressors are calculated as means of potential leafs. So I'd expect it to be possible to extract the probabilities of each value.

还有另一种模拟此值的方法,例如通过处理树结构代码用于 DecisionTreeClassifier predict_proba 不可直接转让。

Is there another way to simulate this, e.g. by processing the tree structure? The code for DecisionTreeClassifier's predict_proba wasn't directly transferable.

推荐答案

此函数改编 hellpanderr的答案中的代码以提供每种结果的概率:

This function adapts code from hellpanderr's answer to provide probabilities of each outcome:

from sklearn.tree import DecisionTreeRegressor
import pandas as pd

def decision_tree_regressor_predict_proba(X_train, y_train, X_test, **kwargs):
    """Trains DecisionTreeRegressor model and predicts probabilities of each y.

    Args:
        X_train: Training features.
        y_train: Training labels.
        X_test: New data to predict on.
        **kwargs: Other arguments passed to DecisionTreeRegressor.

    Returns:
        DataFrame with columns for record_id (row of X_test), y 
        (predicted value), and prob (of that y value).
        The sum of prob equals 1 for each record_id.
    """
    # Train model.
    m = DecisionTreeRegressor(**kwargs).fit(X_train, y_train)
    # Get y values corresponding to each node.
    node_ys = pd.DataFrame({'node_id': m.apply(X_train), 'y': y_train})
    # Calculate probability as 1 / number of y values per node.
    node_ys['prob'] = 1 / node_ys.groupby(node_ys.node_id).transform('count')
    # Aggregate per node-y, in case of multiple training records with the same y.
    node_ys_dedup = node_ys.groupby(['node_id', 'y']).prob.sum().to_frame()\
        .reset_index()
    # Extract predicted leaf node for each new observation.
    leaf = pd.DataFrame(m.decision_path(X_test).toarray()).apply(
        lambda x:x.to_numpy().nonzero()[0].max(), axis=1).to_frame(
            name='node_id')
    leaf['record_id'] = leaf.index
    # Merge with y values and drop node_id.
    return leaf.merge(node_ys_dedup, on='node_id').drop(
        'node_id', axis=1).sort_values(['record_id', 'y'])

示例(请参见< a href = https://colab.research.google.com/drive/1O475-dUdJNtwg8osS8FfpVH_QeMRwf9g?usp=sharing rel = nofollow noreferrer>此笔记本):

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Works better with min_samples_leaf > 1.
res = decision_tree_regressor_predict_proba(X_train, y_train, X_test,
                                            random_state=0, min_samples_leaf=5)
res[res.record_id == 2]
#      record_id       y        prob
#   25         2    20.6    0.166667
#   26         2    22.3    0.166667
#   27         2    22.7    0.166667
#   28         2    23.8    0.333333
#   29         2    25.0    0.166667

这篇关于与DecisionTreeRegressor的预报_等效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆