Spark IllegalArgumentException:列特征的类型必须为 struct<type:tinyint,size:int,indices:array<int>,values:array<double>> [英] Spark IllegalArgumentException: Column features must be of type struct&lt;type:tinyint,size:int,indices:array&lt;int&gt;,values:array&lt;double&gt;&gt;

查看:372
本文介绍了Spark IllegalArgumentException:列特征的类型必须为 struct<type:tinyint,size:int,indices:array<int>,values:array<double>>的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用适合我的数据的 org.apache.spark.ml.regression.LinearRegression.因此,我将原始 RDD 转换为 dataframe,并尝试将其提供给 linearRegression 模型.

val spark: SparkSession = SparkSession.builder.master(local").getOrCreateval parsedData = dataRDD.map{项目 =>val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)val features = Vectors.dense(doubleArray)行(item._4.toDouble,特征)}val 架构 = 列表(StructField("label", DoubleType, true),StructField(功能",新的 org.apache.spark.mllib.linalg.VectorUDT,真))val df = spark.createDataFrame(解析数据,结构类型(架构))val lr = 新的线性回归().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)val lr_model = lr.fit(df)

这是数据框的样子:

+---------+-------------+|标签|特点|+---------+-------------+|5.0|[0.0,1.0,0.0]||20.0|[0.0,1.0,0.0]||689.0|[0.0,1.0,0.0]||627.0|[0.0,1.0,0.0]||127.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||76.0|[0.0,1.0,0.0]||5.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||0.0|[0.0,1.0,0.0]||2.0|[0.0,1.0,0.0]||329.0|[0.0,1.0,0.0]||2354115.0|[0.0,1.0,0.0]||5.0|[0.0,1.0,0.0]||4303.0|[0.0,1.0,0.0]|+---------+-------------+

但它显示了下面的错误.

java.lang.IllegalArgumentException:要求失败:列特征的类型必须为 struct,values:array>但实际上是 struct,values:array>.

后面的数据类型似乎与所需的数据类型没有什么不同.有人可以帮忙吗?

解决方案

您正在使用 org.apache.spark.ml.regression.LinearRegression (sparkML) 和旧版本的 VectorUDT(不推荐使用的 mllib)并且它们似乎不能一起工作.

new org.apache.spark.mllib.linalg.VectorUDT 替换为 new org.apache.spark.ml.linalg.VectorUDT,它应该可以工作了.>

请注意,为了避免声明架构,您可以使用 toDF 创建数据帧(在导入 spark 的隐式之后),让 Spark 推断正确的类型(org.apache.spark.ml.linalg.VectorUDT) 给你:

import org.apache.spark.ml.linalg.Vectors导入 spark.implicits._val df = dataRDD.map{ item =>val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)val features = Vectors.dense(doubleArray)(item._4.toDouble,功能)}.toDF(标签",特征")

I'm trying to use the org.apache.spark.ml.regression.LinearRegression fit my data. So I've got the original RDD transformed to dataframe, and have tried to feed it to the linearRegression model.

val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val parsedData = dataRDD.map{
  item =>
    val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
    val features = Vectors.dense(doubleArray)
    Row(item._4.toDouble, features)
}

val schema = List(
  StructField("label", DoubleType, true),
  StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)

val df = spark.createDataFrame(
  parsedData,
  StructType(schema)
)
val lr = new LinearRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

val lr_model = lr.fit(df)

And here is what the dataframe looks like:

+---------+-------------+
|    label|     features|
+---------+-------------+
|      5.0|[0.0,1.0,0.0]|
|     20.0|[0.0,1.0,0.0]|
|    689.0|[0.0,1.0,0.0]|
|    627.0|[0.0,1.0,0.0]|
|    127.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|     76.0|[0.0,1.0,0.0]|
|      5.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      2.0|[0.0,1.0,0.0]|
|    329.0|[0.0,1.0,0.0]|
|2354115.0|[0.0,1.0,0.0]|
|      5.0|[0.0,1.0,0.0]|
|   4303.0|[0.0,1.0,0.0]|
+---------+-------------+

But it presented the error below.

java.lang.IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.

The later data type doesn't seem to be different from the one required. Can anyone help?

解决方案

You are using org.apache.spark.ml.regression.LinearRegression (sparkML) with the old version of VectorUDT (mllib which is deprecated) and they do not seem to work together.

Replace new org.apache.spark.mllib.linalg.VectorUDT by new org.apache.spark.ml.linalg.VectorUDT and it should work.

Note that to avoid declaring the schema, you can create the dataframe with toDF (after importing spark's implicits) to let Spark infer the right type (org.apache.spark.ml.linalg.VectorUDT) for you:

import org.apache.spark.ml.linalg.Vectors
import spark.implicits._
val df = dataRDD.map{ item =>
    val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
    val features = Vectors.dense(doubleArray)
    (item._4.toDouble, features)
}.toDF("label", "features")

这篇关于Spark IllegalArgumentException:列特征的类型必须为 struct<type:tinyint,size:int,indices:array<int>,values:array<double>>的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆