与 MSE 相比,为什么使用 MAE 标准训练随机森林回归器如此缓慢? [英] Why is training a random forest regressor with MAE criterion so slow compared to MSE?
问题描述
当使用 sklearn 的 RandomForestRegress 的平均绝对误差标准对小型应用程序(<50K 行 <50 列)进行训练时,其速度比使用均方误差慢近 10 倍.为了说明即使在一个小数据集上:
When training on even small applications (<50K rows <50 columns) using the mean absolute error criterion for sklearn's RandomForestRegress is nearly 10x slower than using mean squared error. To illustrate even on a small data set:
import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
def fit_rf_criteria(criterion, X=X, y=y):
reg = RandomForestRegressor(n_estimators=100,
criterion=criterion,
n_jobs=-1,
random_state=1)
start = time.time()
reg.fit(X, y)
end = time.time()
print(end - start)
fit_rf_criteria('mse') # 0.13266682624816895
fit_rf_criteria('mae') # 1.26043701171875
为什么使用 'mae' 标准训练 RandomForestRegressor 需要这么长时间?我想为更大的应用程序优化 MAE,但发现 RandomForestRegressor 调整到这个标准的速度太慢了.
Why does using the 'mae' criterion take so long for training a RandomForestRegressor? I want to optimize MAE for larger applications, but find the speed of the RandomForestRegressor tuned to this criterion prohibitively slow.
推荐答案
感谢@hellpanderr 分享对项目问题的引用.总而言之——当随机森林回归器针对 MSE 进行优化时,它会针对 L2 范数和基于均值的杂质指标进行优化.但是当回归器使用 MAE 标准时,它会针对 L1 范数进行优化,这相当于计算中位数.不幸的是,sklearn 对 MAE 的回归器实现目前似乎需要 O(N^2).
Thank you @hellpanderr for sharing a reference to the project issue. To summarize – when the random forest regressor optimizes for MSE it optimizes for the L2-norm and a mean-based impurity metric. But when the regressor uses the MAE criterion it optimizes for the L1-norm which amounts to calculating the median. Unfortunately, sklearn's the regressor's implementation for MAE appears to take O(N^2) currently.
这篇关于与 MSE 相比,为什么使用 MAE 标准训练随机森林回归器如此缓慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!