使用SHAP时如何解释GBT分类器的base_value? [英] How to interpret base_value of GBT classifier when using SHAP?

查看:333
本文介绍了使用SHAP时如何解释GBT分类器的base_value?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了

您是否希望切换到S型概率空间( link ="logit" ):

 来自scipy.special导入expit,logit#概率y = clf.predict_proba(X_train)[:, 1]#预期的原始基准值y_raw = logit(y).mean()#预期概率,即概率空间中的基值打印(expit(y_raw))0.8875405774316522 

概率空间中第0个数据点的相关图:

请注意,从shap的角度来看,概率 base_value (如果没有可用数据,他们称之为基准概率),不是一个没有独立变量的理性人所定义的概率( 0.6373626373626373 (在这种情况下)


完整的可复制示例:

来自sklearn.datasets的

 导入load_breast_cancer从sklearn.model_selection导入train_test_split从sklearn.ensemble导入GradientBoostingClassifier将熊猫作为pd导入进口杂货打印(shap .__ version__)X,y = load_breast_cancer(return_X_y = True)X = pd.DataFrame(数据= X)y = pd.DataFrame(data = y)X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 0)clf = GradientBoostingClassifier(random_state = 0)clf.fit(X_train,y_train.values.ravel())#将JS可视化代码加载到笔记本shap.initjs()解释器= shap.TreeExplainer(clf,model_output ="raw")shap_values = explorer.shap_values(X_train)来自scipy.special import expit,logit#概率y = clf.predict_proba(X_train)[:, 1]#预期的原始基准值y_raw = logit(y).mean()#预期概率,即概率空间中的基值print(预期的原始分数(在S型之前):",y_raw)print(期望概率:",expit(y_raw))#可视化第一个预测的解释(使用matplotlib = True避免使用Javascript)shap.force_plot(explainer.expected_value [0],shap_values [0 ,:],X_train.iloc [0 ,:],link ="logit") 

输出:

  0.36.0预期原始分数(乙状结肠之前):2.065861773054686预期概率:0.8875405774316522 

I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.

To understand the plot the library says:

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue (these force plots are introduced in our Nature BME paper).

So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?

!pip install shap

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)

print(clf.predict(X_train).mean())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

解决方案

To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:

# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]

The relevant plot for 0th data point in raw space:

shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")

Should you wish to switch to sigmoid probability space (link="logit"):

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522

The relevant plot for 0th data point in probability space:

Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)


Full reproducible example:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")

Output:

0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

这篇关于使用SHAP时如何解释GBT分类器的base_value?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆