ML 模型无法正确预测 [英] ML Model not predicting properly

查看:49
本文介绍了ML 模型无法正确预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 SMR、逻辑回归等各种技术创建 ML 模型(回归).使用所有技术,我无法获得超过 35% 的效率.这是我正在做的:

I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:

X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')

对于情节,我得到以下信息:

For the plot, I'm getting the following:

我尝试了各种参数调整,更改参数 C,gamma 甚至尝试了不同的内核,但没有任何改变精度.甚至尝试过 SVR、逻辑回归而不是 SVC,但没有任何帮助.我尝试了不同的缩放来训练输入数据,例如 StandardScalar()scale().

I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar() and scale().

我使用这个作为参考

我该怎么办?

推荐答案

根据经验,我们通常遵循以下约定:

As a rule of thumb, we usually follow this convention:

  1. 对于少量特征,请使用 Logistic Regression.
  2. 对于很多功能但不是很多数据,请使用 SVM.
  3. 对于许多功能和大量数据,请使用 神经网络.

因为您的数据集是 10K 个案例,所以最好使用 Logistic Regression,因为 SVM 将需要很长时间才能完成!

Because your dataset is a 10K cases, it'd be better to use Logistic Regression because SVM will take forever to finish!.

尽管如此,由于您的数据集包含很多的类,因此您的实现中可能会出现类不平衡.因此,我尝试通过使用 StratifiedKFold 而不是 train_test_split ,它不能保证分裂中的平衡类.

Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of train_test_split which doesn't guarantee balanced classes in the splits.

此外,我将 GridSearchCVStratifiedKFold 执行交叉验证以调整参数并尝试所有不同的优化器

Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!

所以完整的实现如下:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np


def getDataset(path, x_attr, y_attr):
    """
    Extract dataset from CSV file
    :param path: location of csv file
    :param x_attr: list of Features Names
    :param y_attr: Y header name in CSV file
    :return: tuple, (X, Y)
    """
    df = pd.read_csv(path)
    X = X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
    Y = np.array(df[y_attr])
    return X, Y

def stratifiedSplit(X, Y):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
    train_index, test_index = next(sss.split(X, Y))
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    return X_train, X_test, Y_train, Y_test


def run(X_data, Y_data):
    X_train, X_test, Y_train, Y_test = stratifiedSplit(X_data, Y_data)
    param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
                  'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
    model = LogisticRegression(random_state=0)
    clf = GridSearchCV(model, param_grid, cv=StratifiedKFold(n_splits=10))
    clf.fit(X_train, Y_train)
    print(accuracy_score(Y_train, clf.best_estimator_.predict(X_train)))
    print(accuracy_score(Y_test, clf.best_estimator_.predict(X_test)))


X_data, Y_data = getDataset("data - Sheet1.csv", ['distance'], 'orders')

run(X_data, Y_data)

<小时>

尽管尝试了各种不同的算法,准确度没有超过36%!!

如果你想让一个人通过 T 恤颜色识别/分类另一个人,你不能说:嘿,如果它是红色的,那意味着他是约翰,如果它是红色的,那就是彼得,但如果它是红色的,那就是艾斯林!!他会说真的,黑客有什么不同"?!!.

If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.

这正是您数据集中的内容!

And that's exactly what is in your dataset!

简单地,运行 print(len(np.unique(X_data)))print(len(np.unique(Y_data))) 你会发现简而言之,这些数字是如此奇怪:

Simply, run print(len(np.unique(X_data))) and print(len(np.unique(Y_data))) and you'll find that the numbers are so weird, in a nutshell you have:

Number of Cases: 10000 !!
Number of Classes: 118 !!
Number of Unique Inputs (i.e. Features): 66 !!

所有课程都分享大量的信息,这使得高达 36% 的准确率令人印象深刻!

All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!

换句话说,你没有信息特征,导致每个类模型缺乏唯一性!

怎么办?我相信您不允许删除某些类,因此您只有两种解决方案:

What to do? I believe you are not allowed to remove some classes, so the only two solutions you have are:

  1. 要么接受这个非常有效的结果.

或添加更多信息功能.

<小时>

更新

如果您提供了相同的数据集但具有更多特征(即完整的特征集),现在的情况就不同了.


Update

Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.

我建议您执行以下操作:

I recommend you do the following:

  1. 预处理您的数据集(即通过输入缺失值或删除包含缺失值的行,并将日期转换为某些唯一值来准备它(示例) ...等).

  1. Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).

检查哪些特征对Orders 类最重要,您可以通过使用Forests of Trees 来评估特征的重要性来实现这一点.这里是一个完整而简单的示例,说明如何在<代码>Scikit-Learn.

Check what features are most important to the Orders Classes, you can achieve that by using of Forests of Trees to evaluate the importance of features. Here is a complete and simple example of how to do that in Scikit-Learn.

创建一个新版本的数据集,但这次将 Orders 作为 Y 响应,上面找到的特征作为 X 变量.

Create a new version of the dataset but this time hold Orders as the Y response, and the above-found features as the X variables.

遵循我在上面的实现中向您展示的相同的 GrdiSearchCVStratifiedKFold 过程.

Follow the same GrdiSearchCV and StratifiedKFold procedure that I showed you in the implementation above.

<小时>

提示

根据 Vivek Kumar 在下面的评论中提到的,stratify 参数已添加到 Scikit-learn 更新到 train_test_split 函数.


Hint

As per mentioned by Vivek Kumar in the comment below, stratify parameter has been added in Scikit-learn update to the train_test_split function.

它通过传递类似数组的基本事实来工作,因此您不需要我在上面的函数 stratifiedSplit(X, Y) 中的解决方法.

It works by passing the array-like ground truth, so you don't need my workaround in the function stratifiedSplit(X, Y) above.

这篇关于ML 模型无法正确预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆