如何从python3中的xgboost模型中提取决策规则(特征拆分)? [英] How to extract decision rules (features splits) from xgboost model in python3?
问题描述
我需要从我在 python 中拟合的 xgboost 模型中提取决策规则.我使用 0.6a2 版本的 xgboost 库,我的 python 版本是 3.5.2.
我的最终目标是使用这些拆分来装箱变量(根据拆分).
我没有发现这个版本的模型的任何属性会导致我分裂.
plot_tree
给了我类似的东西.然而,它是树的可视化.
对于 xgboost 模型,我需要类似 https://stackoverflow.com/a/39772170/4559070 之类的东西
这是可能的,但并不容易.我建议您使用 scikit-learn
中的 GradientBoostingClassifier
,它类似于 xgboost
,但具有对构建树的本机访问权限.>
但是,使用 xgboost
,可以获得模型的文本表示,然后对其进行解析:
from sklearn.datasets import load_iris从 xgboost 导入 XGBClassifier# 构建一个非常简单的模型X, y = load_iris(return_X_y=True)模型 = XGBClassifier(max_depth=2, n_estimators=2)模型拟合(X,Y);# 转储到文本文件model.get_booster().dump_model('xgb_model.txt', with_stats=True)#读取文件内容with open('xgb_model.txt', 'r') as f:txt_model = f.read()打印(txt_model)
它将打印出 6 棵树的文本描述(2 个估计器,每个估计器由 3 棵树组成,每个类一个),其开头如下:
助推器[0]:0:[f2<2.45] yes=1,no=2,missing=1,gain=72.2968,cover=66.66671:叶子=0.143541,封面=22.22222:叶=-0.0733496,封面=44.4444助推器[1]:0:[f2<2.45]是=1,否=2,缺失=1,增益=18.0742,覆盖=66.66671:叶=-0.0717703,封面=22.22222:[f3<1.75]是=3,否=4,缺失=3,增益=41.9078,覆盖=44.44443:叶子=0.124,封面=244:叶=-0.0668394,封面=20.4444...
例如,现在您可以从此描述中提取所有拆分:
导入重新# 尝试提取所有模式,如[f2<2.45]"splits = re.findall('\[f([0-9]+)<([0-9]+.[0-9]+)\]', txt_model)分裂
它会打印你的元组列表(feature_id,split_value),比如
[('2', '2.45'),('2', '2.45'),('3', '1.75'),('3', '1.65'),('2', '4.95'),('2', '2.45'),('2', '2.45'),('3', '1.75'),('3', '1.65'),('2', '4.95')]
您可以根据需要进一步处理此列表.
I need to extract the decision rules from my fitted xgboost model in python. I use 0.6a2 version of xgboost library and my python version is 3.5.2.
My ultimate goal is to use those splits to bin variables ( according to the splits).
I did not come across any property of the model for this version which can give me splits.
plot_tree
is giving me something similar. However it is visualization of the tree.
I need something like https://stackoverflow.com/a/39772170/4559070 for xgboost model
It is possible, but not easy. I would recommend you to use GradientBoostingClassifier
from scikit-learn
, which is similar to xgboost
, but has native access to the built trees.
With xgboost
, however, it is possible to get a textual representation of the model and then parse it:
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
# build a very simple model
X, y = load_iris(return_X_y=True)
model = XGBClassifier(max_depth=2, n_estimators=2)
model.fit(X, y);
# dump it to a text file
model.get_booster().dump_model('xgb_model.txt', with_stats=True)
# read the contents of the file
with open('xgb_model.txt', 'r') as f:
txt_model = f.read()
print(txt_model)
It will print you a textual description of 6 trees (2 estimators, each consists of 3 trees, one per class), which starts like this:
booster[0]:
0:[f2<2.45] yes=1,no=2,missing=1,gain=72.2968,cover=66.6667
1:leaf=0.143541,cover=22.2222
2:leaf=-0.0733496,cover=44.4444
booster[1]:
0:[f2<2.45] yes=1,no=2,missing=1,gain=18.0742,cover=66.6667
1:leaf=-0.0717703,cover=22.2222
2:[f3<1.75] yes=3,no=4,missing=3,gain=41.9078,cover=44.4444
3:leaf=0.124,cover=24
4:leaf=-0.0668394,cover=20.4444
...
Now you can, for example, extract all splits from this description:
import re
# trying to extract all patterns like "[f2<2.45]"
splits = re.findall('\[f([0-9]+)<([0-9]+.[0-9]+)\]', txt_model)
splits
It will print you the list of tuples (feature_id, split_value), like
[('2', '2.45'),
('2', '2.45'),
('3', '1.75'),
('3', '1.65'),
('2', '4.95'),
('2', '2.45'),
('2', '2.45'),
('3', '1.75'),
('3', '1.65'),
('2', '4.95')]
You can further process this list as you wish.
这篇关于如何从python3中的xgboost模型中提取决策规则(特征拆分)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!