如何将 Spark MLlib RandomForestModel.predict 响应作为文本值是/否? [英] How to get Spark MLlib RandomForestModel.predict response as text value YES/NO?

查看:30
本文介绍了如何将 Spark MLlib RandomForestModel.predict 响应作为文本值是/否?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Apache Spark MLLib 实现 RandomForest 算法.我有 csv 格式的数据集,具有以下功能

Hi I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the csv format with the following features

DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO

我想使用 RandomForest MLlib 并对最后一个字段 Action 进行预测,我希望响应为 YES/NO.

I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.

我正在关注 github 创建随机森林模型.由于我拥有除一个 int 特征之外的所有分类特征,因此我使用以下代码将它们转换为 JavaRDD 请让我知道,以防万一它出错

I am following code from github to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint> please let me know incase its wrong

// Load and parse the data file.
        JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");

       // I have 14 features so giving 14 as arg to the following
        final HashingTF tf = new HashingTF(14);

        // Create LabeledPoint datasets for Actionable and nonactionable
        JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
            @Override public LabeledPoint call(String alert) {
                List<String> featureList = Arrays.asList(alert.trim().split(","));
                String actionType = featureList.get(featureList.size() - 1).toLowerCase();
                return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
            }
        });

类似上面我创建了testdata并在下面的代码中使用来做预测

Similarly above I create testdata and use in the following code to do prediction

JavaPairRDD<Double, Double> predictionAndLabel =
        testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
          @Override
          public Tuple2<Double, Double> call(LabeledPoint p) {
            return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
          }
        });

如何根据我的最后一个字段进行预测 操作和预测应为是/否?当前的 predict 方法返回 double 无法理解我如何实现它?我是否也遵循了将分类特征导入 LabledPoint 的正确方法,请指导我是机器学习和 Spark MLlib 的新手.

How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint please guide I am new to machine learning and Spark MLlib.

推荐答案

我更熟悉 Scala 版本,但我会尽力提供帮助.

I am more familiar with the scala version but I'll try to help.

您需要将目标变量 (Action) 和所有分类特征映射到从 0 开始的级别,例如 0,1,2,3...例如 router1, router2, ... router5 到 0,1,2...4.与您的目标变量相同,我认为这是您实际映射的唯一变量,是/否到 1/0(我不确定您的 tf.transform(featureList) 实际在做什么).

You need to map the target variable (Action) and all categorical features into levels starting in 0 like 0,1,2,3... For example router1, router2, ... router5 into 0,1,2...4. The same with your target variable which I think was the only one you actually mapped, yes/no to 1/0 (I am not sure what your tf.transform(featureList) is actually doing).

完成此操作后,您可以训练您的 Randomforest 分类器,指定分类特征的映射.基本上它需要你告诉你哪些功能是分类的以及它们有多少级,这是scala版本,但你可以很容易地将它翻译成java:

Once you have done this you can train your Randomforest classifier specifying the map for categorical features. Basically it needs you to tell which features are categorical and how many levels do they have, this is the scala version but you can easily translate it into java:

val categoricalFeaturesInfo = Map[Int, Int]((2,2),(3,5))

这基本上是说在您的功能列表中,第 3 个 (2) 有 2 个级别 (2,2),第 4 个 (3) 有 5 个级别 (3,5).其余的被视为双打.

this is basically saying that in your list of features the 3rd one (2) has 2 levels (2,2) and the 4th one (3) has 5 levels (3,5). The rest are considered Doubles.

现在你在训练分类器和其他参数时传递 categoricalFeaturesInfo 为:

Now you pass the categoricalFeaturesInfo when training the classifier together with the other parameters as:

val modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

现在,当您需要对其进行评估时,预测函数将返回双精度 0,1,您可以使用它来计算准确度、精度或任何所需的指标.

Now when you need to evaluate it, the predict function will return a double 0,1 and you can use that to compute accuracy, precision or any metric needed.

如果您有一个 testData,您在其中进行了与以前相同的转换,则这是示例(再次对不起 scala):

This is the example (sorry scala again) if you have a testData where you did the same transformations as before:

val predictionAndLabels = testData.map { point =>
  val prediction = modelRF.predict(point.features)
  (point.label, prediction)
} 

这里你的结果很清楚,标签为 1/0,预测值也是 1/0,准确率、精度和召回率的任何计算都很简单.

Here your results are clear, the label as 1/0 and the predicted value is also 1/0, any computation of Accuracy, Precision and Recall is straightforward.

希望能帮到你!!

这篇关于如何将 Spark MLlib RandomForestModel.predict 响应作为文本值是/否?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆