如何获得星火MLlib RandomForestModel。predict响应,文本值是/否? [英] How to get Spark MLlib RandomForestModel.predict response as text value YES/NO?

查看:441
本文介绍了如何获得星火MLlib RandomForestModel。predict响应,文本值是/否?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我想实现使用Apache星火MLLib随机森林算法。我在csv格式的数据集,具有以下特点

<$p$p><$c$c>DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,网络1,App1的,路由器1,不可达,YES
0,网络1,App2的,Router5,不可达,NO

我要使用随机森林MLlib并在最后一个字段做行动prediction,我想响应,YES / NO。

我从<一以下code href=\"https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestExample.java\"相对=nofollow> github上创建随机森林模型。由于我除了一个INT功能的所有类别的功能我用下面的code将它们转换成 JavaRDD&LT; LabeledPoint&GT; 请让我知道柜面它的错。

  //加载并解析数据文件。
        JavaRDD&LT;串GT;数据= jsc.textFile(/ tmp目录/ XYZ /数据/训练dataset.csv);       //我有14个功能,所以给14个参数如下
        最终HashingTF TF =新HashingTF(14);        //创建可操作和nonactionable LabeledPoint数据集
        JavaRDD&LT; LabeledPoint&GT; labledData = Data.Map中(新功能与LT;弦乐,LabeledPoint&GT;(){
            @覆盖公共LabeledPoint调用(字符串警报){
                清单&LT;串GT; featureList = Arrays.asList(alert.trim()分裂());
                串操作类型= featureList.get(featureList.size() - 1).toLowerCase();
                返回新LabeledPoint(actionType.equals(是)?1:0,tf.transform(featureList));
            }
        });

同样上面创建TESTDATA并在以下code用做prediction

  JavaPairRDD&LT;双,双&GT; predictionAndLabel =
        testData.mapToPair(新PairFunction&LT; LabeledPoint人,双人,双&GT;(){
          @覆盖
          公共Tuple2&LT;双,双&GT;调用(LabeledPoint P){
            返回新Tuple2&LT;双,双&GT;(型号predict(p.features()),p.label());
          }
        });

我如何根据我的最后一个字段动作和prediction应是YES / NO prediction?当前predict方法返回一倍无法理解我怎么实现的呢?另外我在下面的分类特征的正确的方法为 LabledPoint 请指导我是新来的机器学习和Spark MLlib。


解决方案

我比较熟悉的斯卡拉版本,但我会尽力帮助。

您需要在目标变量(行动)和所有类别特征到各级地图开始在0像0,1,2,3 ...例如ROUTER1,ROUTER2,... router5为0,1,2。 ..4。与你的目标变量同样我认为这是你实际映射只有一个,是/否以1/0(我不知道你的 tf.transform(featureList)其实这样做)。

一旦你做到了这一点,你可以训练你的随机森林分类指定地图类别特征。基本上,它需要你告诉哪些功能是绝对的和多少级做他们的,这是Scala的版本,但你可以很容易地把它翻译成Java:

  VAL categoricalFeaturesInfo =地图[INT,INT((2,2),(3,5))

这基本上是说,在你的列表采用了3一(2)具有2级(2,2)和第4(3)的有5个级别(3,5)。其余的被认为是双打。

现在您通过categoricalFeaturesInfo与其他参数一起训练时的分类:

  VAL modelRF = RandomForest.trainClassifier(trainingData,numClasses,categoricalFeaturesInfo,numTrees,featureSubsetStrategy,杂质,MAXDEPTH,maxBins)

现在,当你需要对其进行评估时,predict函数会返回一个双0.1,你可以用它来计算的准确性,precision或需要的任何指标。

这是例子(对不起再次斯卡拉)如果你有一个TESTDATA,你做了相同的变换和以前一样:

  VAL predictionAndLabels = {testData.map点=&GT;
  VAL prediction = modelRF。predict(point.features)
  (point.label,prediction)
}

下面的结果是显而易见的,标签为1/0和predicted值也是1/0,精度,precision的任何计算和调用非常简单。

我希望它能帮助!

Hi I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the csv format with the following features

DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO

I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.

I am following code from github to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint> please let me know incase its wrong

// Load and parse the data file.
        JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");

       // I have 14 features so giving 14 as arg to the following
        final HashingTF tf = new HashingTF(14);

        // Create LabeledPoint datasets for Actionable and nonactionable
        JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
            @Override public LabeledPoint call(String alert) {
                List<String> featureList = Arrays.asList(alert.trim().split(","));
                String actionType = featureList.get(featureList.size() - 1).toLowerCase();
                return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
            }
        });

Similarly above I create testdata and use in the following code to do prediction

JavaPairRDD<Double, Double> predictionAndLabel =
        testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
          @Override
          public Tuple2<Double, Double> call(LabeledPoint p) {
            return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
          }
        });

How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint please guide I am new to machine learning and Spark MLlib.

解决方案

I am more familiar with the scala version but I'll try to help.

You need to map the target variable (Action) and all categorical features into levels starting in 0 like 0,1,2,3... For example router1, router2, ... router5 into 0,1,2...4. The same with your target variable which I think was the only one you actually mapped, yes/no to 1/0 (I am not sure what your tf.transform(featureList) is actually doing).

Once you have done this you can train your Randomforest classifier specifying the map for categorical features. Basically it needs you to tell which features are categorical and how many levels do they have, this is the scala version but you can easily translate it into java:

val categoricalFeaturesInfo = Map[Int, Int]((2,2),(3,5))

this is basically saying that in your list of features the 3rd one (2) has 2 levels (2,2) and the 4th one (3) has 5 levels (3,5). The rest are considered Doubles.

Now you pass the categoricalFeaturesInfo when training the classifier together with the other parameters as:

val modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

Now when you need to evaluate it, the predict function will return a double 0,1 and you can use that to compute accuracy, precision or any metric needed.

This is the example (sorry scala again) if you have a testData where you did the same transformations as before:

val predictionAndLabels = testData.map { point =>
  val prediction = modelRF.predict(point.features)
  (point.label, prediction)
} 

Here your results are clear, the label as 1/0 and the predicted value is also 1/0, any computation of Accuracy, Precision and Recall is straightforward.

I hope it helps!!

这篇关于如何获得星火MLlib RandomForestModel。predict响应,文本值是/否?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆