格式化 Spark ML 的数据 [英] Formatting data for Spark ML

查看:26
本文介绍了格式化 Spark ML 的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 和 Spark ML 的新手.我使用函数 KMeansDataGenerator.generateKMeansRDD 生成了一些数据,但是在格式化这些数据时我失败了,以便它可以被 ML 算法使用(这里是 K-Means).

I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means).

错误是

线程main"中的异常 java.lang.IllegalArgumentException: 不支持数据类型 ArrayType(DoubleType,false).

Exception in thread "main" java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,false) is not supported.

使用 VectorAssembler 时会发生这种情况.

It happens when using VectorAssembler.

val generatedData = KMeansDataGenerator.generateKMeansRDD(sc, numPoints = 1000, k = 5, d = 3,
        r =  5, numPartitions = 1)

val df = generatedData.toDF()

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("value"))
  .setOutputCol("features")
val df_final = assembler.transform(df).select("features")
df_final.show()

val nbClusters = 5
val nbIterations = 200
val kmeans = new KMeans().setK(nbClusters).setSeed(1L).setMaxIter(nbIterations)
val model = kmeans.fit(df)

推荐答案

VectorAssembler 只接受三种类型的列:

VectorAssembler accepts only three types of columns:

  • DoubleType - 双标量,可选择包含列元数据.
  • NumericType - 任意数字.
  • VectorUDT - 向量列.
  • DoubleType - double scalar, optionally with column metadata.
  • NumericType - arbitrary numeric.
  • VectorUDT - vector column.

您正在尝试传递不受支持的 ArrayType(DoubleType).您应该将数据转换为支持的类型(o.a.s.ml.linalg.DenseVector/VectorUDT 似乎是一个合理的选择).例如:

You are trying to pass ArrayType(DoubleType) which is not supported. You should convert your data to supported type (o.a.s.ml.linalg.DenseVector / VectorUDT seems like a reasonable choice). For example:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions.{col, udf}

// Spark 2.0. For 1.x use mllib
// https://spark.apache.org/docs/latest/sql-programming-guide.html#data-types
val seqAsVector = udf((xs: Seq[Double]) => Vectors.dense(xs.toArray))

val df_final = df.withColumn("features", seqAsVector(col("value")))

这篇关于格式化 Spark ML 的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆