DMLC在Spark-1.6.1上使用XGBoost-4j [英] XGBoost-4j by DMLC on Spark-1.6.1

查看:100
本文介绍了DMLC在Spark-1.6.1上使用XGBoost-4j的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过Spark-1.6.1上的DMLC使用XGBoost实现.我能够使用XGBoost训练我的数据,但面临预测困难.我实际上想以可以在Apache Spark mllib库中完成的方式进行预测,这有助于计算训练误差,精度,召回率,特异性等.

我发布了以下代码,同时也出现了错误.我在spark-shell中使用了xgboost4j-spark-0.5-jar-with-dependencies.jar来启动.

 导入org.apache.spark.mllib.regression.LabeledPoint导入org.apache.spark.mllib.linalg.Vectors导入org.apache.spark.mllib.util.MLUtils导入org.apache.spark.SparkContext._导入org.apache.spark.mllib.linalg.Vectors导入org.apache.spark.mllib.regression.LabeledPoint导入ml.dmlc.xgboost4j.scala.Booster导入ml.dmlc.xgboost4j.scala.spark.XGBoost导入ml.dmlc.xgboost4j.scala.DMatrix导入ml.dmlc.xgboost4j.scala.{Booster,DMatrix}导入ml.dmlc.xgboost4j.scala.spark.{DataUtils,XGBoost}导入org.apache.spark.{SparkConf,SparkContext}//加载并解析数据文件.val data = sc.textFile("file:///home/partha/credit_approval_2_attr.csv")val data1 = sc.textFile("file:///home/partha/credit_app_fea.csv")val parsedData = data.map {line =>val parts = line.split(',').map(_.toDouble)LabeledPoint(parts(0),Vectors.dense(parts.tail))} .cache()val parsedData1 = data1.map {line =>val parts = line.split(',').map(_.toDouble)Vectors.dense(部分)}//调整参数val paramMap = List("eta"->0.1f,"max_depth"->5,"num_class"->2,目标"->"multi:softmax","colsample_bytree"->0.8,"alpha"->1,子样本"->0.5).toMap//训练模型val numRound = 20val模型= XGBoost.train(parsedData,paramMap,numRound,nWorkers = 1)val pred = model.predict(parsedData1)pred.collect() 

pred的输出:

  res0:Array [Array [Array [Array [Float]]] = Array(Array(Array(Array(0.0),Array(1.0),Array(1.0),Array(1.0),Array(0.0),Array(0.0),Array(1.0),Array(1.0),Array(0.0),Array(1.0),Array(0.0),Array(0.0),Array(0.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(0.0),Array(1.0),Array(1.0),Array(0.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(0.0),Array(1.0),Array(1.0),Array(1.0),Array(0.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(1.0),Array(0.0),Array(0.0),Array(0.0),Array(0.0),Array(1.0),Array(0.0),Array(0.0),Array(0.0),数组(0.0),数组(0.0),数组(0.0),数组(1.0),数组(1.0),数组(1.0),数组(... 

现在,当我使用时:

  val labelAndPreds = parsedData.map {point =>val预测= model.predict(point.features)(point.label,预测)} 

输出:

 <控制台>:66:错误:方法值重载,并带有替代方法:(testSet:ml.dmlc.xgboost4j.scala.DMatrix)Array [Array [Float]]<(测试集:org.apache.spark.rdd.RDD [org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD [Array [Array [Float]]]无法应用于(org.apache.spark.mllib.linalg.Vector)val预测= model.predict(point.features)^ 

然后尝试一下,因为预测需要RDD [Vector].

  val labelAndPreds1 = parsedData.map {point =>val预测= model.predict(Vectors.dense(point.features))(point.label,预测)} 

结果是:

 < console>:66:错误:方法值重载且带有替代项:(值:Array [Double])org.apache.spark.mllib.linalg.Vector<(firstValue:Double,otherValues:Double *)org.apache.spark.mllib.linalg.Vector无法应用于(org.apache.spark.mllib.linalg.Vector)val预测= model.predict(Vectors.dense(point.features))^ 

很显然,这是我要解决的RDD类型问题,使用GBT触发火花很容易(解决方案

实际上,XGboost算法不提供此功能.我在这里面临着同样的问题,并实现了以下方法:

  import ml.dmlc.xgboost4j.scala.spark.DataUtils//感谢@Z Simondef labelPredict(testSet:RDD [XGBLabeledPoint],useExternalCache:布尔值= false,助推器:XGBoostModel):RDD [(Float,Float)] = {val broadcastBooster = testSet.sparkContext.broadcast(booster)testSet.mapPartitions {testData =>val(auxiliaryIterator,testDataIterator)= testData.duplicateval testDataArray = assistantIterator.toArrayVAL预测= broadcastBooster.value.predict(new DMatrix(testDataIterator)).flattentestDataArray.zip(预测).地图 {情况(labeledPoint,predictionValue)=>(labeledPoint.label,predictionValue)} .toIterator} 

}

这几乎与XGBoost实际使用的代码相同,但是它在预测返回中使用了labeledpoint的标签.当您将Labeledpoint传递给此方法时,它将为每个值返回一个带有((标签,预测)的)元组的RDD.

I am trying to use the XGBoost implementation by DMLC on Spark-1.6.1. I am able to train my data with XGBoost but facing difficulties in prediction. I actually want to do prediction in the way it can be done in Apache Spark mllib libraries, that helps with calculation of training error,precision, recall, specificity etc.

I am posting the codes below, also the error I am getting. I used this xgboost4j-spark-0.5-jar-with-dependencies.jar in spark-shell to start.

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import ml.dmlc.xgboost4j.scala.DMatrix
import ml.dmlc.xgboost4j.scala.{Booster, DMatrix}
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
import org.apache.spark.{SparkConf, SparkContext}




//Load and parse the data file.
val data = sc.textFile("file:///home/partha/credit_approval_2_attr.csv")
val data1 = sc.textFile("file:///home/partha/credit_app_fea.csv")


val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}.cache()

val parsedData1 = data1.map { line =>
    val parts = line.split(',').map(_.toDouble)
    Vectors.dense(parts)
}



//Tuning Parameters
val paramMap = List(
      "eta" -> 0.1f,  
      "max_depth" -> 5,
      "num_class" -> 2,
      "objective" -> "multi:softmax" ,
      "colsample_bytree" -> 0.8,
       "alpha" -> 1,
       "subsample" -> 0.5).toMap

  //Training the model  
val numRound = 20
val model = XGBoost.train(parsedData, paramMap, numRound, nWorkers = 1)
val pred = model.predict(parsedData1)
pred.collect()

Output from pred:

res0: Array[Array[Array[Float]]] = Array(Array(Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(...

Now when I am using:

val labelAndPreds = parsedData.map { point =>
          val prediction = model.predict(point.features)
          (point.label, prediction)
        }

Output:

<console>:66: error: overloaded method value predict with alternatives:
  (testSet: ml.dmlc.xgboost4j.scala.DMatrix)Array[Array[Float]] <and>
  (testSet: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD[Array[Array[Float]]]
 cannot be applied to (org.apache.spark.mllib.linalg.Vector)
                  val prediction = model.predict(point.features)
                                     ^

And then tried this, since predict requires an RDD[Vector].

val labelAndPreds1 = parsedData.map { point =>
          val prediction = model.predict(Vectors.dense(point.features))
          (point.label, prediction)
        }

The outcome was:

<console>:66: error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (org.apache.spark.mllib.linalg.Vector)
                  val prediction = model.predict(Vectors.dense(point.features))
                                                         ^

Clearly its an issue of RDD type which I am trying to sort out, this is easy with GBT on spark (http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts).

Am I trying to do this the right way?

Any help or suggestion would be awesome.

解决方案

Actually this isn't available at XGboost algorithms. I'm facing the same problem here and have implemented the following method:

import ml.dmlc.xgboost4j.scala.spark.DataUtils // thanks to @Z Simon

def labelPredict(testSet: RDD[XGBLabeledPoint],
               useExternalCache: Boolean = false,
               booster: XGBoostModel): RDD[(Float, Float)] = {
val broadcastBooster = testSet.sparkContext.broadcast(booster)
testSet.mapPartitions { testData =>
  val (auxiliaryIterator, testDataIterator) = testData.duplicate
  val testDataArray = auxiliaryIterator.toArray
  val prediction = broadcastBooster.value.predict(new DMatrix(testDataIterator)).flatten
  testDataArray
    .zip(prediction)
    .map {
      case (labeledPoint, predictionValue) =>
        (labeledPoint.label, predictionValue)
    }.toIterator
}

}

This is almost the same code that XGBoost has actually, but it is using the label of labeledpoint in the predictions return. When you pass a Labeledpoint to this method, it will return a RDD of Tuple with (label, prediction) for each value.

这篇关于DMLC在Spark-1.6.1上使用XGBoost-4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆