在斯卡拉MultilayerPerceptronClassifier prepare数据 [英] Prepare data for MultilayerPerceptronClassifier in scala

查看:1052
本文介绍了在斯卡拉MultilayerPerceptronClassifier prepare数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请记住我是新来的Scala。

这是我试图效仿的榜样:
https://spark.apache.org/docs/1.5.1/ml -ann.html

它使用该数据集:
<一href=\"https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt\" rel=\"nofollow\">https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

我有$ P $使用下面的code以获得在Scala中分类数据帧ppared我的.csv。

  //进口ML
进口org.apache.spark.ml.classification.MultilayerPerceptronClassifier
进口org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
进口org.apache.spark.mllib.util.MLUtils
进口org.apache.spark.sql.Row//进口转变
进口sqlContext.implicits._
进口com.databricks.spark.csv._
进口org.apache.spark.mllib.linalg {向量,向量}//加载数据
VAL数据2 = sqlContext.csvFile(/用户/管理员/下载/ ds_15k_10-2.csv)//重命名任何一列功能
// VAL DF2 = data.withColumnRenamed(ip_crowding,功能)
VAL DF2 = data2.select(gst_id_matched,ip_crowding,lat_long_dist);斯卡拉&GT; DF2.take(2)
res6:数组[org.apache.spark.sql.Row] =阵列([0,0,0],[0,0,1628859.542])//定义doublelfunc
VAL toDouble = UDF [双,字符串](_.toDouble)//转换所有翻番
VAL featureDf = DF2
.withColumn(gst_id_matched,toDouble(DF2(gst_id_matched)))
.withColumn(ip_crowding,toDouble(DF2(ip_crowding)))
.withColumn(lat_long_dist,toDouble(DF2(lat_long_dist)))
。选择(gst_id_matched,ip_crowding,lat_long_dist)
//定义格式
VAL toVec4 = UDF [载体,双,双{(V1,V2)=&GT; Vectors.dense(V1,V2)}//对于特征格式被gst_id_matched
VAL EN codeLabel = UDF [双,字符串](_比赛
{案0.0=&GT; 0.0案件1.0=&GT; 1.0})//转化数据
    VAL DF = featureDf
.withColumn(特征,toVec4(featureDf(ip_crowding),featureDf(lat_long_dist)))
.withColumn(标签,恩codeLabel(featureDf(gst_id_matched)))
。选择(标签,功能)VAL拆分= df.randomSplit(阵列(0.6,0.4),种子= 1234L)
VAL列车=拆分(0)
VAL测试=拆分(1)
//指定神经网络的层次:
//大小为4的输入层(功能),两个中间尺寸5和4的大小3的输出(班)
VAL层=数组[INT](0,0,0,0)
//创建教练机,并设置其参数
VAL教练=新MultilayerPerceptronClassifier()。setLayers(层).setBlockSize(12).setSeed(1234L).setMaxIter(10)
//训练模型
VAL模型= trainer.fit(火车)

最后一行生成此错误

  15/11/21二十二点46分23秒错误执行人:异常的任务1.0级11.0(TID 15)
java.lang.ArrayIndexOutOfBoundsException:0

我的怀疑:

当我审视数据集,它看起来罚款分类

 斯卡拉&GT; df.take(2)
RES3:数组[org.apache.spark.sql.Row] =阵列([0.0,[0.0,0.0],[0.0,[0.0,1628859.542]])

但Apache的例子数据集不同,我的转型并没有给我什么,我need.Can有人请帮我改造的数据集或理解问题的根源。

这是Apache的数据是什么样子:

 斯卡拉&GT; data.take(1)
res8:数组[org.apache.spark.sql.Row] =阵列([1.0,(4,[0,1,2,3],[ - 0.222222,0.5,-0.762712,-0.8​​33333])])


解决方案

你的问题的根源是层的错误定义。当您使用

  VAL层=数组[INT](0,0,0,0)

这意味着你想在每一层它只是没有任何意义的零节点的网络。一般来说在输入层的神经元的数目应等于特征的数量和每个隐藏层应包含至少一个神经元

让重建数据的方式simpling您code:

 进口org.apache.spark.sql.functions.colVAL DF = sc.parallelize(SEQ(
  (0,0,0),(0,0,1628859.542)
))。toDF(gst_id_matched,ip_crowding,lat_long_dist)

转换所有列双打:

  VAL数值= DF
  。选择(df.columns.map(C =&GT;西(C).cast(双)的别名(C)):_ *)。
  .withColumnRenamed(gst_id_matched,标签)

组装的特点:

 进口org.apache.spark.ml.feature.VectorAssemblerVAL汇编=新VectorAssembler()
  .setInputCols(阵列(ip_crowding,lat_long_dist))
  .setOutputCol(特征)VAL数据= assembler.transform(数字)
数据显示// + ----- + ----------- ------------- + + --------------- - +
// |标签| ip_crowding | lat_long_dist |功能|
// + ----- + ----------- ------------- + + --------------- - +
// | 0.0 | 0.0 | 0.0 | (2,[],[])|
// | 0.0 | 0.0 | 1628859.542 | [0.0,1628859.542] |
// + ----- + ----------- ------------- + + --------------- - +

训练和测试网络:

 进口org.apache.spark.ml.classification.MultilayerPerceptronClassifierVAL层=数组[INT](2,3,5,3)//注意输入层2个神经元
VAL教练=新MultilayerPerceptronClassifier()
  .setLayers(层)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)VAL模型= trainer.fit(数据)
model.transform(数据).show// + ----- + ----------- ------------- + + --------------- - + ---------- +
// |标签| ip_crowding | lat_long_dist |功能| prediction |
// + ----- + ----------- ------------- + + --------------- - + ---------- +
// | 0.0 | 0.0 | 0.0 | (2,[],[])| 0.0 |
// | 0.0 | 0.0 | 1628859.542 | [0.0,1628859.542] | 0.0 |
// + ----- + ----------- ------------- + + --------------- - + ---------- +

Please keep in mind I'm new to scala.

This is the example I am trying to follow: https://spark.apache.org/docs/1.5.1/ml-ann.html

It uses this dataset: https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

I have prepared my .csv using the code below to get a data frame for classification in Scala.

//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}

//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")

//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");

scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])

//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)

//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")


//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }

//Format for features which is gst_id_matched
val encodeLabel   = udf[Double, String]( _ match 
{ case "0.0" => 0.0 case "1.0" => 1.0} )

//Transformed dataset
    val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")

val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: 
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter


val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)

The last line generates this error

15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0

My suspicions:

When I examine the dataset,it looks fine for classification

scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])

But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.

This is what the apache dataset looks like:

scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])

解决方案

The source of your problems is a wrong definition of layers. When you use

val layers = Array[Int](0, 0, 0, 0)

it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.

Lets recreate your data simpling your code on the way:

import org.apache.spark.sql.functions.col

val df = sc.parallelize(Seq(
  ("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")

Convert all columns to doubles:

val numeric = df
  .select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
  .withColumnRenamed("gst_id_matched", "label")

Assemble features:

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("ip_crowding","lat_long_dist"))
  .setOutputCol("features")

val data = assembler.transform(numeric)
data.show

// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist|         features|
// +-----+-----------+-------------+-----------------+
// |  0.0|        0.0|          0.0|        (2,[],[])|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+

Train and test network:

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

val model = trainer.fit(data)
model.transform(data).show

// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist|         features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// |  0.0|        0.0|          0.0|        (2,[],[])|       0.0|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|       0.0|
// +-----+-----------+-------------+-----------------+----------+

这篇关于在斯卡拉MultilayerPerceptronClassifier prepare数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆