新手:遇到问题无法预测未来的成功时该从哪里开始 [英] Newbie: where to start given a problem to predict future success or not

查看:79
本文介绍了新手:遇到问题无法预测未来的成功时该从哪里开始的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个基于生产网络的产品,允许用户对商品的未来价值(或需求)做出预测,历史数据包含约10万个示例,每个示例具有约5个参数;

We have had a production web based product that allows users to make predictions about the future value (or demand) of goods, the historical data contains about 100k examples, each example has about 5 parameters;

考虑一类称为谓词的数据:

Consider a class of data called a prediciton:

prediction {
    id: int
    predictor: int    
    predictionDate: date
    predictedProductId: int
    predictedDirection: byte  (0 for decrease, 1 for increase)
    valueAtPrediciton: float
}

和测量预测结果的配对结果类:

and a paired result class that measures the result of the prediction:

predictionResult {
    id: int
    valueTenDaysAfterPrediction: float
    valueTwentyDaysAfterPrediction: float
    valueThirtyDaysAfterPrediction: float
}

我们可以定义一个成功的测试用例,在考虑预测时的方向和价值时,是否有两个将来的价值检查点是有利的.

we can define a test case such for success, where if any two of the future value check points are favorable when conisdering direction and value at the time of prediction.

success(p: prediction, r: predictionResult): bool = 
    count: int 
    count = 0

    // value is predicted to fall
    if p.predictedDirection = 0 then
       if p.valueAtPrediciton > r.valueTenDaysAfterPrediction then count = count + 1
       if p.valueAtPrediciton > r.valueTwentyDaysAfterPrediction then count = count + 1
       if p.valueAtPrediciton > r.valueThirtyDaysAfterPrediction then count = count + 1

    // value is predicted to increase
    else
       if p.valueAtPrediciton < r.valueTenDaysAfterPrediction then count = count + 1
       if p.valueAtPrediciton < r.valueTwentyDaysAfterPrediction then count = count + 1
       if p.valueAtPrediciton < r.valueThirtyDaysAfterPrediction then count = count + 1

    // success if count = 2 or count = 3
    return (count > 1)

从用户提交表单开始,就知道了预测类中的所有信息,而到了后来,predictResult中的信息才知道.理想情况下,模型或算法可以从我们将算法应用于新的先决条件的三年历史中得出,我们可以获得是否成功的概率(我对布尔值Y/N标志感到满意)无论这是否有趣).

Everything in the prediction class is known the moment the user submits the form, and the information in the predictionResult is not known until later; Ideally the model or algorythm can be derived from our three year history that algorythm is applied to a new prediciton we can get a probability as to whether it will be a success or not (I would be happy with a boolean Y/N flag as to wether this is interesting or not).

我对机器学习了解不多,我正在尝试通过材料来学习.但是,如果我能提供一些指导,这样我可以完全研究和实践解决此类问题所需的知识,将会非常棒.

I don't know much about machine learning, and I am trying to make my way through material. But it would be great if I could have some guidance so I can research and practice exactly what I need to solve a problem like this.

谢谢

推荐答案

功能

您需要做的第一件事是确定将使用哪些信息作为证据来将用户的预测分类为准确与否.例如,您可以从简单的东西开始,例如做出预测的用户的身份以及对相同或相似商品做出预测时他们的历史准确性.这些信息将作为功能提供给下游的机器学习工具,用于对用户的预测进行分类.

The first thing you'll need to do is decide what information you'll use as evidence to classify a user's prediction as being accurate or not. For example, you could start with simple stuff like the identity of the user making the prediction, and their historical accuracy when making predictions on the same or similar goods. This information will be provided to downstream machine learning tools as features that will be used to classify the users' predictions.

培训,开发和测试数据

您需要将10万个历史示例分为三个部分:培训,开发和测试.您应该将大多数数据(例如80%)放入您的培训集中.这将是您用来训练预测准确性分类器的数据集.一般来说,用于训练分类器的数据越多,生成的模型越准确.

You'll want to split your 100k historical examples into three parts: training, development, and test. You should put most of the data, say 80% of it, in your training set. This will be the dataset you use to train your prediction accuracy classifier. Generally speaking the more data you use to train your classifier the more accurate the resulting model will be.

另外两个数据集(开发和测试)将用于评估分类器的性能.您将使用开发集来评估分类器不同配置或功能表示形式变化的准确性.之所以称为开发集,是因为您在开发模型或系统时使用它来不断评估分类性能.

The two other data sets, development and test, will be used to evaluate the performance of your classifier. You'll use the development set to evaluate the accuracy of different configurations of your classifier or variations in the feature representation. It's called the development set since you use it to continuously evaluate classification performance as you develop your model or system.

稍后,在建立一个可以在开发数据上实现良好性能的模型之后,您可能希望对分类器在新数据上的性能进行无偏估计.为此,您将使用测试集来评估分类器在处理数据(除用于开发数据之前)方面的性能.

Later, after you've built a model that achieves good performance on the development data, you'll probably want an unbiased estimated of how well your classifier will perform on new data. For this you'll use the test set to evaluate how well the classifier does on data other than what you used to develop it.

分类器/ML包

设置了初步的功能并将数据分为训练,开发和测试之后,就可以选择机器学习包和分类器了.一些支持多种分类器的优质软件包包括:

After you have your preliminary feature set and you've split the data into training, development, and test, you're ready to choose a machine learning package and classifier. A few good packages that support numerous types of classifiers include:

  • Weka (Java)
  • Rapid Miner (Java)
  • Orange (Python)

您应使用哪种分类器取决于许多因素,包括您要进行哪种预测(例如,二进制,多类),您要使用哪种功能以及所需的训练数据量使用.

Which classifier you should use depends on many factors including what kind of predictions you'd like to make (e.g., binary, multiclass), what kinds of features you'd like to use, and the amount of training data you want to use.

例如,如果您只是对用户的判断是否正确进行二进制分类,则可能要尝试 支持向量机(SVM) .它们的基本表述仅限于进行二进制谓词.但是,如果这就是您所需要的,它们通常是一个不错的选择,因为它们可以生成非常准确的模型.

For example, if you just what to make a binary classification of whether a user's predication is probably accurate or not, you might want to try support-vector-machines (SVMs). Their basic formulation is limited to doing binary predications. But, if that is all you need, they are often a good choice since they can result in very accurate models.

但是,训练SVM所需的时间随训练数据的大小而缩放得很差.要训​​练大量数据,您可能会决定使用 随机森林 .当在相同大小的数据集上训练随机森林和SVM时,随机森林通常会产生与SVM模型一样准确或几乎一样精确的模型.但是,随机森林可以使您使用更多的训练数据,而使用更多的训练数据将

However, the time required to train a SVM scales poorly with the size of the training data. To train on substantial amounts data, you might decide to use something like random forests. When random forests and SVMs are trained on the same size data sets, random forests will typically produce a model that is either as accurate or nearly as accurate as a SVM model. However, random forests can allow you to use more training data and using more training data will typically increase the accuracy of your model.

深入挖掘

以下是其他一些入门知识的入门指南

Here are a few pointers to other good places to get started with machine learning

  • Video Lectures from Andrew Ng's machine learning course at Stanford
  • Andrew Moore's machine learning tutorials
  • Hastie's The Elements of Statistical Learning - Hastie has posted a PDF of the book here.

这篇关于新手:遇到问题无法预测未来的成功时该从哪里开始的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆