LightGBM中的交叉验证 [英] Cross-validation in LightGBM

查看:1846
本文介绍了LightGBM中的交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读了LightGBM关于交叉验证的文档之后,我希望这个社区可以阐明交叉验证的结果,并使用LightGBM改进我们的预测.我们应该如何使用lightgbm.cv的字典输出来改善我们的预测?

After reading through LightGBM's documentation on cross-validation, I'm hoping this community can shed light on cross-validating results and improving our predictions using LightGBM. How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?

这是一个例子-我们使用以下代码训练我们的简历模型:

Here's an example - we train our cv model using the code below:

cv_mod = lgb.cv(params, 
                d_train, 
                500, 
                nfold = 10, 
                early_stopping_rounds = 25,
                stratified = True)

我们如何使用从以上代码的最佳迭代中找到的参数来预测输出?在这种情况下,cv_mod没有像lightgbm.train这样的预测"方法,并且从lightgbm.cv输出的字典在lightgbm.train.predict(..., pred_parameters = cv_mod)中使用时会引发错误.

How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).

我错过了重要的转型步骤吗?

Am I missing an important transformation step?

推荐答案

通常,CV的目的是进行超参数优化.目的是评估模型构建过程的性能.

In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.

基本训练/测试拆分在概念上与1倍CV相同(拆分的自定义大小与k倍CV中的1/K训练大小相反).进行更多分割(即k> 1 CV)的好处是获得有关泛化误差估计的更多信息.在获得误差+统计不确定性的意义上,有更多信息. 对CrossValidated的讨论(从添加到问题的链接开始,这些链接涵盖相同的问题,但用不同的方式表示).它涵盖了嵌套的交叉验证,而且绝对不简单.但是,如果您通常围绕这个概念着迷,这将在各种不平凡的情况下为您提供帮助.您必须带走的想法是: CV的目的是评估模型构建过程的性能.

A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.

牢记这一想法,一般来说,一种方法如何处理超参数估计(不仅在LightGBM中)?

Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?

  • 您想在一些数据上训练带有参数集的模型,并在独立(验证)集上评估模型的每个变体.然后,您打算通过选择可提供最佳评估指标的变量来选择最佳参数.
  • 可以通过简单的训练/测试拆分来完成.但是评估的性能以及最佳模型参数的选择可能只是特定拆分的波动.
  • 因此,您可以通过几个训练/测试划分(即k倍CV)对这些模型中的每一个进行统计上更稳健的平均评估.
  • You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
  • This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
  • Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.

然后,您可以进一步说您还有一个附加的保留集,该保留集在开始超参数优化之前已被分离.这样,您可以在该集合上评估所选的最佳模型,以衡量最终的泛化误差.但是,您甚至可以走得更远,而不必拥有一个测试样本,而可以拥有一个外部CV循环,这使我们可以进行嵌套交叉验证.

Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.

从技术上讲,lightbgm.cv()仅允许您使用固定的模型参数评估k倍拆分的性能.对于超参数调整,您将需要在提供不同参数的循环中运行它,并重新编码平均性能以选择最佳参数集.循环完成后.此接口不同于sklearn,它为您提供了在CV循环中进行超参数优化的完整功能.就个人而言,我建议使用lightgbm的sklearn-API .它只是本机lightgbm.train()功能的包装,因此并不慢.但是它允许您使用完整的sklearn工具包堆栈,这使您的生活变得更加轻松.

Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.

这篇关于LightGBM中的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆