区分过度拟合与良好预测 [英] Distinguishing overfitting vs good prediction

查看：95 发布时间：2020/5/4 8:59:04 python numpy machine-learning regression scikit-learn

本文介绍了区分过度拟合与良好预测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这些是有关如何计算和计算的问题.减少机器学习中的过度拟合.我认为许多机器学习的新手都会有相同的问题，因此我尝试通过示例和问题弄清楚，希望这里的答案可以帮助其他人.

These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.

我的文本样本非常小，我正在尝试预测与它们相关的值.我已经使用sklearn计算tf-idf，并将其插入回归模型中进行预测.这给了我26个具有6323个功能的样本-数量不多..我知道:

I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:

>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)
>> term_freq = count_vectorizer.fit_transform(texts)
>> transformer = TfidfTransformer()
>> X = transformer.fit_transform(term_freq) 
>> print X.shape

(26, 6323)

将26个6323个特征(X)和相关分数(y)的26个样本插入到LinearRegression模型中，可以提供良好的预测.这些是使用cross_validation.LeaveOneOut(X.shape[0], indices=True)的留一法交叉验证获得的:

Inserting those 26 samples of 6323 features (X) and associated scores (y), into a LinearRegression model, gives good predictions. These are obtained using leave-one-out cross validation, from cross_validation.LeaveOneOut(X.shape[0], indices=True) :

using ngrams (n=1):
     human  machine  points-off  %error
      8.67    8.27    0.40       1.98
      8.00    7.33    0.67       3.34
      ...     ...     ...        ...
      5.00    6.61    1.61       8.06
      9.00    7.50    1.50       7.50
mean: 7.59    7.64    1.29       6.47
std : 1.94    0.56    1.38       6.91

非常好！使用ngram(n = 300)而不是unigram(n = 1)，会发生类似的结果，这显然是不正确的.任何文本中都不会出现300个单词，因此预测应该会失败，但是不会:

Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:

using ngrams (n=300):
      human  machine  points-off  %error
       8.67    7.55    1.12       5.60
       8.00    7.57    0.43       2.13
       ...     ...     ...        ...
mean:  7.59    7.59    1.52       7.59
std :  1.94    0.08    1.32       6.61

问题1: 这可能意味着预测模型过度拟合.我只知道这一点，因为我为ngram选择了一个极高的值(n = 300)，我知道它不能产生很好的结果.但是，如果我不具备此知识，那么您通常如何判断该模型过度拟合?换句话说，如果使用合理的量度(n = 1)，您如何知道良好的预测是过拟合而模型运行良好的结果?

Question 1: This might mean that the prediction model is overfitting the data. I only know this because I chose an extreme value for the ngrams (n=300) which I KNOW can't produce good results. But if I didn't have this knowledge, how would you normally tell that the model is over-fitting? In other words, if a reasonable measure (n=1) were used, how would you know that the good prediction was a result of being overfit vs. the model just working well?

问题2: (在这种情况下)防止过拟合的最佳方法是什么，以确保预测结果是否良好?

Question 2: What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

问题3: 如果使用LeaveOneOut交叉验证，该模型如何过拟合并获得良好结果?过度拟合意味着预测准确性会受到影响-那么为什么对文本遗漏的预测却没有受到影响呢?我能想到的唯一原因是:在主要为0的tf-idf稀疏矩阵中，文本之间存在很强的重叠，因为有太多的术语为0，因此回归分析认为文本之间的相关性很高.

Question 3: If LeaveOneOut cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.

即使您不是全部，也请回答任何问题.谢谢！

Please answer any of the questions even if you don't know them all. Thanks!

区分过度拟合与良好预测 [英] Distinguishing overfitting vs good prediction

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

区分过度拟合与良好预测 [英] Distinguishing overfitting vs good prediction

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭