如何找到最佳次数的多项式? [英] How to find the best degree of polynomials?

查看:120
本文介绍了如何找到最佳次数的多项式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是机器学习的新手,目前对此感到困惑. 首先,我使用线性回归来拟合训练集,但得到非常大的RMSE.然后我尝试使用多项式回归来减少偏差.

I'm new to Machine Learning and currently got stuck with this. First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regression to reduce the bias.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)

poly_predict = poly_reg.predict(X_poly)
poly_mse = mean_squared_error(X, poly_predict)
poly_rmse = np.sqrt(poly_mse)
poly_rmse

然后,我得到的结果比线性回归好一点,然后我继续将度数设置为3/4/5,结果一直在变好.但是随着程度的提高,它可能有点过拟合.

Then I got slightly better result than linear regression, then I continued to set degree = 3/4/5, the result kept getting better. But it might be somewhat overfitting as degree increased.

最佳多项式次数应为在交叉验证集中产生最低RMSE的次数.但是我不知道如何实现这一目标.我应该使用GridSearchCV吗?或其他任何方法?

The best degree of polynomial should be the degree that generates the lowest RMSE in cross validation set. But I don't have any idea how to achieve that. Should I use GridSearchCV? or any other method?

如果可以的话,我非常感谢.

Much appreciate if you could me with this.

推荐答案

下一次您应该提供X/Y的数据,否则您将得到一些虚拟的数据,它将更快,并为您提供特定的解决方案.现在,我已经创建了y = X**4 + X**3 + X + 1形式的虚拟方程.

You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.

您可以通过许多方法对此进行改进,但是快速找到最佳学位的方法是简单地将每个学位的数据拟合并选择性能最佳(例如,最低RMSE)的学位.

There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).

您还可以决定保留火车/测试/验证数据的方式.

You can also play with how you decide to hold out your train/test/validation data.

import numpy as np
import matplotlib.pyplot as plt 

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X = np.arange(100).reshape(100, 1)
y = X**4 + X**3 + X + 1

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rmses = []
degrees = np.arange(1, 10)
min_rmse, min_deg = 1e10, 0

for deg in degrees:

    # Train features
    poly_features = PolynomialFeatures(degree=deg, include_bias=False)
    x_poly_train = poly_features.fit_transform(x_train)

    # Linear regression
    poly_reg = LinearRegression()
    poly_reg.fit(x_poly_train, y_train)

    # Compare with test data
    x_poly_test = poly_features.fit_transform(x_test)
    poly_predict = poly_reg.predict(x_poly_test)
    poly_mse = mean_squared_error(y_test, poly_predict)
    poly_rmse = np.sqrt(poly_mse)
    rmses.append(poly_rmse)

    # Cross-validation of degree
    if min_rmse > poly_rmse:
        min_rmse = poly_rmse
        min_deg = deg

# Plot and present results
print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, rmses)
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')

这将打印:

最好的4级,RMSE 1.27689038706e-08

Best degree 4 with RMSE 1.27689038706e-08

或者,您也可以构建一个新的类来执行多项式拟合,然后使用一组参数将其传递给GridSearchCV.

Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.

这篇关于如何找到最佳次数的多项式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆