区分过度拟合与良好预测 [英] Distinguishing overfitting vs good prediction

查看:95
本文介绍了区分过度拟合与良好预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这些是有关如何计算和计算的问题.减少机器学习中的过度拟合.我认为许多机器学习的新手都会有相同的问题,因此我尝试通过示例和问题弄清楚,希望这里的答案可以帮助其他人.

These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.

我的文本样本非常小,我正在尝试预测与它们相关的值.我已经使用sklearn计算tf-idf,并将其插入回归模型中进行预测.这给了我26个具有6323个功能的样本-数量不多..我知道:

I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:

>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)
>> term_freq = count_vectorizer.fit_transform(texts)
>> transformer = TfidfTransformer()
>> X = transformer.fit_transform(term_freq) 
>> print X.shape

(26, 6323)

将26个6323个特征(X)和相关分数(y)的26个样本插入到LinearRegression模型中,可以提供良好的预测.这些是使用cross_validation.LeaveOneOut(X.shape[0], indices=True)的留一法交叉验证获得的:

Inserting those 26 samples of 6323 features (X) and associated scores (y), into a LinearRegression model, gives good predictions. These are obtained using leave-one-out cross validation, from cross_validation.LeaveOneOut(X.shape[0], indices=True) :

using ngrams (n=1):
     human  machine  points-off  %error
      8.67    8.27    0.40       1.98
      8.00    7.33    0.67       3.34
      ...     ...     ...        ...
      5.00    6.61    1.61       8.06
      9.00    7.50    1.50       7.50
mean: 7.59    7.64    1.29       6.47
std : 1.94    0.56    1.38       6.91

非常好!使用ngram(n = 300)而不是unigram(n = 1),会发生类似的结果,这显然是不正确的.任何文本中都不会出现300个单词,因此预测应该会失败,但是不会:

Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:

using ngrams (n=300):
      human  machine  points-off  %error
       8.67    7.55    1.12       5.60
       8.00    7.57    0.43       2.13
       ...     ...     ...        ...
mean:  7.59    7.59    1.52       7.59
std :  1.94    0.08    1.32       6.61

问题1: 这可能意味着预测模型过度拟合.我只知道这一点,因为我为ngram选择了一个极高的值(n = 300),我知道它不能产生很好的结果.但是,如果我不具备此知识,那么您通常如何判断该模型过度拟合?换句话说,如果使用合理的量度(n = 1),您如何知道良好的预测是过拟合而模型运行良好的结果?

Question 1: This might mean that the prediction model is overfitting the data. I only know this because I chose an extreme value for the ngrams (n=300) which I KNOW can't produce good results. But if I didn't have this knowledge, how would you normally tell that the model is over-fitting? In other words, if a reasonable measure (n=1) were used, how would you know that the good prediction was a result of being overfit vs. the model just working well?

问题2: (在这种情况下)防止过拟合的最佳方法是什么,以确保预测结果是否良好?

Question 2: What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

问题3: 如果使用LeaveOneOut交叉验证,该模型如何过拟合并获得良好结果?过度拟合意味着预测准确性会受到影响-那么为什么对文本遗漏的预测却没有受到影响呢?我能想到的唯一原因是:在主要为0的tf-idf稀疏矩阵中,文本之间存在很强的重叠,因为有太多的术语为0,因此回归分析认为文本之间的相关性很高.

Question 3: If LeaveOneOut cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.

即使您不是全部,也请回答任何问题.谢谢!

Please answer any of the questions even if you don't know them all. Thanks!

推荐答案

您通常如何判断模型过度拟合?

how would you normally tell that the model is over-fitting?

一条有用的经验法则是当模型在其自己的训练集上的性能比在其坚持的验证集或交叉验证设置下的性能好得多时,您可能会过度拟合.不过,这还不是全部.

One useful rule of thumb is that you may be overfitting when your model's performance on its own training set is much better than on its held-out validation set or in a cross-validation setting. That's not all there is to it, though.

我链接到的博客条目描述了过拟合测试的过程:绘制训练集和验证集错误,作为训练集大小的函数.如果它们在图的右端显示出稳定的缺口,则可能是过度拟合.

The blog entry I linked to describes a procedure for testing for overfit: plot training set and validation set error as a function of training set size. If they show a stable gap at the right end of the plot, you're probably overfitting.

在这种情况下,防止过拟合的最佳方法是什么,以确保预测结果是否良好?

What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

使用保留测试集.仅当您完全完成模型选择(超参数调整)后,才对此集合进行评估.不要训练它,不要在(交叉)验证中使用它.您在测试集上获得的分数是模型的最终评估.这应该显示您是否意外地使验证集过大.

Use a held-out test set. Only do evaluation on this set when you're completely done with model selection (hyperparameter tuning); don't train on it, don't use it in (cross-)validation. The score you get on the test set is the model's final evaluation. This should show whether you've accidentally overfit the validation set(s).

[有时,机器学习会议就像是一场竞赛一样,在他们将最终模型交付给组织者之后,才将测试集提供给研究人员.同时,他们可以随意使用训练集,例如通过使用交叉验证来测试模型. 笑话做类似的事情.]

[Machine learning conferences are sometimes set up like a competition, where the test set is not given to the researchers until after they've delivered their final model to the organisers. In the meanwhile, they can use the training set as they please, e.g. by testing models using cross-validation. Kaggle does something similar.]

如果使用LeaveOneOut交叉验证,那么该模型如何过拟合并获得良好的结果?

If LeaveOneOut cross validation is used, how can the model possibly over-fit with good results?

因为您可以在此交叉验证设置中随意调整模型,直到它在CV中几乎完美表现.

Because you can tune the model as much as you want in this cross-validation setting, until it performs nearly perfectly in CV.

作为一个极端的例子,假设您实现了一个估算器,该估算器本质上是一个随机数生成器.您可以继续尝试随机种子,直到遇到模型",该模型在交叉验证中产生非常低的错误,但是那并不是您选择了正确的模型.这意味着您已经过度适合交叉验证.

As an extreme example, suppose that you've implemented an estimator that is essentially a random number generator. You can keep trying random seeds until you hit a "model" that produces very low error in cross-validation, but that doesn't you've hit the right model. It means you've overfit to the cross-validation.

另请参见这个有趣的故事.

这篇关于区分过度拟合与良好预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆