如何创建于火花ML分类正确的数据帧 [英] How to create correct data frame for classification in Spark ML

查看：893 发布时间：2016/5/22 15:28:17 scala apache-spark apache-spark-sql apache-spark-mllib

本文介绍了如何创建于火花ML分类正确的数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图用星火ML API 运行随机森林分类，但我有与创建正确的数据帧输入到管道的问题。

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline.

下面是样本数据：

age,hours_per_week,education,sex,salaryRange
38,40,"hs-grad","male","A"
28,40,"bachelors","female","A"
52,45,"hs-grad","male","B"
31,50,"masters","female","B"
42,40,"bachelors","male","B"

年龄和 hours_per_week 是整数，而其他的功能，包括标签的 salaryRange 是分类（字符串）

age and hours_per_week are integers while other features including label salaryRange are categorical (String)

加载该csv文件（可以称之为sample.csv）可以通过星火CSV库做这样的：

Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this:

val data = sqlContext.csvFile("/home/dusan/sample.csv")

在默认情况下所有列导入为字符串，所以我们需要改变年龄和hours_per_week来诠释：

By default all columns are imported as string so we need to change "age" and "hours_per_week" to Int:

val toInt    = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))

只是为了检查模式现在的样子：

Just to check how schema looks now:

scala> dataFixed.printSchema
root
 |-- age: integer (nullable = true)
 |-- hours_per_week: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- salaryRange: string (nullable = true)

然后让设置交叉验证和管道：

Then lets set the cross validator and pipeline:

val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf)) 
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)

运行这条线时的错误显示出来：

Error shows up when running this line:

val cmModel = cv.fit(dataFixed)

java.lang.IllegalArgumentException异常：字段特色不存在

这是可以设置标签栏和功能列RandomForestClassifier，但是我有4列，predictors（功能）不是唯一的一个。

It is possible to set label column and feature column in RandomForestClassifier ,however I have 4 columns as predictors (features) not only one.

我应该如何组织我的数据帧因此具有标签和功能正常组织列？

为了您的方便这里到处是code：

For your convenience here is full code :

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.{Vector, Vectors}


object SampleClassification {

  def main(args: Array[String]): Unit = {

    //set spark context
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local");
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import com.databricks.spark.csv._

    //load data by using databricks "Spark CSV Library" 
    val data = sqlContext.csvFile("/home/dusan/sample.csv")

    //by default all columns are imported as string so we need to change "age" and  "hours_per_week" to Int
    val toInt    = udf[Int, String]( _.toInt)
    val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))


    val rf = new RandomForestClassifier()

    val pipeline = new Pipeline().setStages(Array(rf))

    val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)

    // this fails with error
    //java.lang.IllegalArgumentException: Field "features" does not exist.
    val cmModel = cv.fit(dataFixed) 
  }

}

感谢您的帮助！

推荐答案

您只需要确保你在你的数据帧一特色列，它是的键入 VectorUDF ，如下所示：

You simply need to make sure that you have a "features" column in your dataframe that is of type VectorUDF as show below:

scala> val df2 = dataFixed.withColumnRenamed("age", "features")
df2: org.apache.spark.sql.DataFrame = [features: int, hours_per_week: int, education: string, sex: string, salaryRange: string]

scala> val cmModel = cv.fit(df2) 
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually IntegerType.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
    at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
    at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
    at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:118)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
    at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
    at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
    at org.apache.spark.ml.tuning.CrossValidator.transformSchema(CrossValidator.scala:142)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
    at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:107)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)

EDIT1

本质上说需要在你的数据帧的功能，例如标签的特征向量和标签两个领域。实例的类型必须是双击。

Essentially there need to be two fields in your data frame "features" for feature vector and "label" for instance labels. Instance must be of type Double.

要建立一个功能领域与矢量键入首先创建一个 UDF ，如下所示：

To create a "features" fields with Vector type first create a udf as show below:

val toVec4    = udf[Vector, Int, Int, String, String] { (a,b,c,d) => 
  val e3 = c match {
    case "hs-grad" => 0
    case "bachelors" => 1
    case "masters" => 2
  }
  val e4 = d match {case "male" => 0 case "female" => 1}
  Vectors.dense(a, b, e3, e4) 
}

现在也EN code中的标签字段，创建另一个 UDF ，如下所示：

Now to also encode the "label" field, create another udf as shown below:

val encodeLabel    = udf[Double, String]( _ match { case "A" => 0.0 case "B" => 1.0} )

现在我们使用变换原数据帧这两个 UDF ：

Now we transform original dataframe using these two udf:

val df = dataFixed.withColumn(
  "features",
  toVec4(
    dataFixed("age"),
    dataFixed("hours_per_week"),
    dataFixed("education"),
    dataFixed("sex")
  )
).withColumn("label", encodeLabel(dataFixed("salaryRange"))).select("features", "label")

请注意，可以有额外的列/域present的数据帧，但在这种情况下，我只选择了功能和标签：

Note that there can be extra columns / fields present in the dataframe, but in this case I have selected only features and label:

scala> df.show()
+-------------------+-----+
|           features|label|
+-------------------+-----+
|[38.0,40.0,0.0,0.0]|  0.0|
|[28.0,40.0,1.0,1.0]|  0.0|
|[52.0,45.0,0.0,0.0]|  1.0|
|[31.0,50.0,2.0,1.0]|  1.0|
|[42.0,40.0,1.0,0.0]|  1.0|
+-------------------+-----+

现在其高达您设置正确的参数为你的学习算法，使其工作。

Now its upto you to set correct parameters for your learning algorithm to make it work.

这篇关于如何创建于火花ML分类正确的数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何创建于火花ML分类正确的数据帧 [英] How to create correct data frame for classification in Spark ML

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何创建于火花ML分类正确的数据帧 [英] How to create correct data frame for classification in Spark ML

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭