联的机器学习prediction回原始数据集 [英] Linking the Machine Learning Prediction back to the original data set

查看:157
本文介绍了联的机器学习prediction回原始数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我到几个使用机器学习算法上做零售交易数据POC和未来与为缺货分析的prediction模型的过程。我的问题可能听起来愚蠢的,但我真的AP preciate如果你或其他人能回答我。

I am into a process of doing a POC on Retail Transaction Data using few Machine learning Algorithms and coming up with a prediction model for Out of stock analysis. My questions might sound stupid but I would really appreciate if you or anyone else can answer me.

到目前为止,我已经能够得到一个数据集==>转换功能成(labelpoint,特征向量)==>培养出ML模型==>运行测试数据集模型和==>获取predictions。

So far I have been able to get a data set ==> Convert the features into a (labelpoint , Feature Vectors) ==> Train a ML model ==> Run the model on Test DataSet and ==> Get the predictions.

由于我有任何的Java / Python的/斯卡拉语言没有经验,我建我的特点,在数据库和保存这些数据作为一个CSV文件为我的机器学习算法。

Since I have no experience on any of the JAVA/Python/Scala languages, I am building my features in the database and saving that data as a CSV file for my machine learning Algorithm.

我们如何创建使用Scala从原始数据的功能。

How do we create features using Scala from raw data.

源数据集包含许多功能为一组(商店,产品,日期),其录制OOS事件(目标)

The Source Data set consists of many features for a set of (Store, Product , date) and their recorded OOS events (Target)

STOREID(文字列),产品ID(文本列),TRANDATE,(标签/目标),特征1,特征........................ FeatureN

StoreID(Text column), ProductID(Text Column), TranDate , (Label/Target), Feature1, Feature2........................FeatureN

由于该功能只能包含数值的话,我只是创建功能了数字列,而不是那些文字(这是自然键对我来说)的。当我运行一个验证数据集模型,我收到了(prediction,标签)阵列后面。

Since the Features can only contain numeric values so, I just create features out of the numeric columns and not the text ones (Which is the natural key for me). When I run the model on a validation set I get a (Prediction, Label) array back.

现在我怎么连结这个结果集返回到原始数据集,看看哪些特定的(商店,产品,日期),可能有一个可能的缺货事件?

Now how do I link this resultant set back to the original data set and see which specific (Store, Product, Date) might have a possible Out Of Stock event ?

我希望这些问题的声明是不够清楚。

I hope the problem statement was clear enough.

MJ

推荐答案

下面是从星火文档线性回归的例子这是相当有启发性和易于理解。

Spark's Linear Regression Example

Here's a snippet from the Spark Docs Linear Regression example that is fairly instructive and easy to follow.

这既解决了你的问题1和问题2

It solves both your "Problem 1" and "Problem 2"

它并不需要一个连接,甚至不依赖于RDD秩序。

It doesn't need a JOIN and doesn't even rely on RDD order.

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")

下面数据是文本线RDD

val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

问题1:解析功能

这是依赖于数据。在这里,我们可以看到线被分割在进入领域。看来这个数据是完全数字数据的CSV文件。

Problem 1: Parsing the Features

This is data dependent. Here we see that lines are being split on , into fields. It appears this data was a CSV of entirely numeric data.

第一场被视为一个标记点​​(因变量)的标签,而其余字段是从文本转换为double(浮点),并卡在矢量。该载体保持的功能或独立的变量。

The first field is treated as the label of a labelled point (dependent variable), and the rest of the fields are converted from text to double (floating point) and stuck in a vector. This vector holds the features or independent variables.

在你自己的项目,这个你需要记住的部分是解析成LabeledPoints的RDD哪里LabeledPoint,标签的第一个参数,是真正的依赖数值和功能,或者第二个参数的目标,是数字向量。

In your own projects, the part of this you need to remember is the goal of parsing into an RDD of LabeledPoints where the 1st parameter of LabeledPoint, the label, is the true dependent numeric value and the features, or 2nd parameter, is a Vector of numbers.

获取数据到这个状况需要知道如何code。蟒蛇可能是最简单的数据分析。你总是可以使用其他的工具来创建一个纯数字的CSV,在第一列中的因变量,在其他列的数字的功能,并没有标题行 - 然后重复实施例解析功能

Getting the data into this condition requires knowing how to code. Python may be easiest for data parsing. You can always use other tools to create a purely numeric CSV, with the dependent variable in the first column, and the numeric features in the other columns, and no header line -- and then duplicate the example parsing function.

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

在这一点上,我们有一个训练有素的模式对象。在模式对象有<一个href=\"https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/regression/RegressionModel.html#$p$pdict(org.apache.spark.mllib.linalg.Vector)\"相对=nofollow> predict 方法的特征向量与因变量的估计回报工作。

At this point we have a trained model object. The model object has a predict method that operates on feature vectors and returns estimates of the dependent variable.

的ML程序通常需要数字特征向量,但你可以经常免费翻译文本或类别特征(颜色,大小,品牌名)到一些空间数值向量。有多种方法可以做到这一点,比如一袋字的文字,或以分类数据的一个热点编码,你code 1.0或0.0在每个可能的类别成员(注意多重共线性虽然)。这些方法可以创造较大的特征向量,这就是为什么在星火可用于培训模型迭代方法。星火也有一个 SparseVector()类,在这里你可以轻松地创建带有所有,但设置为0.0的某些特征尺寸矢量

The ML routines typically want numeric feature vectors, but you can often translate free text or categorical features (color, size, brand name) into numeric vectors in some space. There are a variety of ways to do this, such as Bag-Of-Words for text, or One Hot Encoding for categorical data where you code a 1.0 or 0.0 for membership in each possible category (watch out for multicollinearity though). These methodologies can create large feature vectors, which is why there are iterative methods available in Spark for training models. Spark also has a SparseVector() class, where you can easily create vectors with all but certain feature dimensions set to 0.0

接下来,他们测试这个模型训练数据,但电话
将与提供的测试数据是LabeledPoint的RDD(依赖值,向量(特征))外部测试数据相同。输入可以通过改变变量 parsedData 其他一些RDD被改变。

Next they test this model with the training data, but the calls would be the same with external test data provided that the test data is a RDD of LabeledPoint( dependent value, Vector(features)). The input could be changed by changing the variable parsedData to some other RDD.

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

注意,这个返回真正的因变量previously存储在 point.label 的元组,并从 point.features 每一行或LabeledPoint。

Notice that this returns tuples of the true dependent variable previously stored in point.label, and the model's prediction from the point.features for each row or LabeledPoint.

现在我们已经准备好做的均方误差,因为 values​​And preDS RDD包含元组(V,P)真值 v 和prediction p 类型都是双的。

Now we are ready to do Mean Squared Error, since the valuesAndPreds RDD contains tuples (v,p) of true value v and the prediction p both of type Double.

的MSE是单数,第一个元组映射到平方距离的RDD ||副总裁|| ** 2 个人,然后取平均值,产生单号。

The MSE is a single number, first the tuples are mapped to an rdd of squared distances ||v-p||**2 individually, and then averaged, yielding a single number.

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

println("training Mean Squared Error = " + MSE)

星火的物流示例

这是相似的,但在这里你可以看到数据已经被解析并分裂成训练和测试集。

Spark's Logistic Example

This is similar, but here you can see data is already parsed and split into training and test sets.

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

下面的模型训练的对抗训练集。

Here the model is trained against the training set.

// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(10)
  .run(training)

和测试(相比),对测试集。请注意,尽管这是一个不同的模型(物流,而不是线性),还有就是需要一个点的功能模式。predict 方法向量作为参数,并返回该点prediction。

And tested (compared) against the test set. Notice that even though this is a different model (Logistic instead of Linear) there is still a model.predict method that takes a point's features vector as a parameter and returns the prediction for that point.

再一次prediction是搭配的真正价值,从标签,在一个元组在性能指标比较。

Once again the prediction is paired with the true value, from the label, in a tuple for comparison in a performance metric.

// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)

有关JOIN什么?所以RDD.join进来,如果你有一个(键,值)两个成对的RDDS,并且需要对应于与这两个值的键相交的RDD。但是,我们并不需要在这里。

What about JOIN? So RDD.join comes in if you have two RDDs of (key, value) pairs, and need an RDD corresponding to the intersection of keys with both values. But we didn't need that here.

这篇关于联的机器学习prediction回原始数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆