LightGBM 中的交叉验证 [英] Cross-validation in LightGBM

查看:198
本文介绍了LightGBM 中的交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们应该如何使用 lightgbm.cv 的字典输出来改进我们的预测?

How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?

这是一个例子 - 我们使用下面的代码训练我们的 cv 模型:

Here's an example - we train our cv model using the code below:

cv_mod = lgb.cv(params, 
                d_train, 
                500, 
                nfold = 10, 
                early_stopping_rounds = 25,
                stratified = True)

我们如何使用从上述代码的最佳迭代中找到的参数来预测输出?在这种情况下,cv_mod 没有预测".lightgbm.train 之类的方法,以及 lightgbm.cv 的字典输出在 lightgbm.train.predict(..., pred_pa​​rameters = cv_mod).

How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).

我是否遗漏了一个重要的转型步骤?

Am I missing an important transformation step?

推荐答案

总的来说,CV 的目的是不是做超参数优化.目的是评估模型构建过程的性能.

In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.

基本的训练/测试拆分在概念上与 1 折 CV 相同(拆分的自定义大小与 k 折 CV 中的 1/K 火车大小形成对比).进行更多分割(即 k>1 CV)的优点是可以获得更多关于泛化误差估计的信息.在获取错误 + 统计不确定性的意义上有更多信息.有一个很好的关于CrossValidated的讨论(从添加到问题的链接开始,这些链接涵盖了相同的问题,但以不同的方式表述).它涵盖了嵌套的交叉验证,绝对不简单.但是,如果您总体上围绕这个概念进行思考,那么这将在各种非平凡的情况下为您提供帮助.您必须带走的想法是:CV 的目的是评估模型构建过程的性能.

A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.

记住这个想法,一般来说,一种方法是如何进行超参数估计的(不仅在 LightGBM 中)?

Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?

  • 您希望使用一组参数对某些数据进行训练,并在独立(验证)集上评估模型的每个变体.然后,您打算通过选择提供您选择的最佳评估指标的变体来选择最佳参数.
  • 可以通过简单的训练/测试拆分来完成.但是评估的性能,以及最佳模型参数的选择,可能只是特定拆分的波动.
  • 因此,您可以在多个训练/测试分组(即 k 倍 CV)上对每个模型进行更稳健的平均评估.
  • You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
  • This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
  • Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.

然后你可以更进一步,说你有一个额外的保留集,它在超参数优化开始之前被分离.通过这种方式,您可以评估该集合上选择的最佳模型以测量最终的泛化误差.但是,您可以更进一步,您可以拥有一个外部 CV 循环,而不是使用单个测试样本,这使我们可以进行嵌套交叉验证.

Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.

从技术上讲,lightbgm.cv() 只允许您评估具有固定模型参数的 k 折分割的性能.对于超参数调整,您需要在循环中运行它,提供不同的参数并重新编码平均性能以选择最佳参数集.循环完成后.此接口与 sklearn 不同,后者为您提供在 CV 循环中进行超参数优化的完整功能.就个人而言,我建议使用 lightgbm 的 sklearn-API.它只是原生 lightgbm.train() 功能的包装器,因此它并不慢.但它允许您使用完整的 sklearn 工具包,这让您的生活更轻松.

Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.

这篇关于LightGBM 中的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆