将特征的 Spark 向量转换为数组 [英] Convert a Spark Vector of features into an array

查看:84
本文介绍了将特征的 Spark 向量转换为数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 features 列,它使用 Spark 的 VectorAssembler 打包成一个向量向量,如下所示.data 是输入数据帧(spark.sql.DataFrame 类型).

I have a features column which is packaged into a Vector of vectors using Spark's VectorAssembler, as follows. data is the input DataFrame (of type spark.sql.DataFrame).

val featureCols = Array("feature_1","feature_2","feature_3")
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)

我正在使用 ClassifierClassificationModel 开发人员 API 开发自定义分类器.ClassificationModel 需要开发一个 predictRaw() 函数,该函数输出模型的预测标签向量.

I am developing a custom Classifier using the Classifier and ClassificationModel developer API. ClassificationModel requires development of a predictRaw() function which outputs a vector of predicted labels from the model.

def predictRaw(features: FeaturesType) : Vector

此函数由 API 设置并接受一个参数、FeaturesType 的特征并输出一个向量(在我的情况下,我将其视为一个 Spark DenseVector因为 DenseVector 扩展了 Vector 特性).

This function is set by the API and takes a parameter, features of FeaturesType and outputs a Vector (which in my case I'm taking to be a Spark DenseVector as DenseVector extends the Vector trait).

由于 VectorAssembler 的封装,features 列是 Vector 类型,每个元素本身就是一个向量,每个训练样本的原始特征.例如:

Due to the packaging by VectorAssembler, the features column is of type Vector and each element is itself a vector, of the original features for each training sample. For example:

特征列 - Vector 类型
[1.0, 2.0, 3.0] - element1,本身是一个向量
[3.5, 4.5, 5.5] - element2,本身是一个向量

features column - of type Vector
[1.0, 2.0, 3.0] - element1, itself a vector
[3.5, 4.5, 5.5] - element2, itself a vector

我需要将这些特征提取到 Array[Double] 中以实现我的 predictRaw() 逻辑.理想情况下,我想要以下结果以保留基数:

I need to extract these features into an Array[Double] in order to implement my predictRaw() logic. Ideally I would like the following result in order to preserve the cardinality:

`val result: Array[Double] = Array(1.0, 3.5, 2.0, 4.5, 3.0, 4.5)` 

即以列为主的顺序,因为我会将其转换为矩阵.

i.e. in column-major order as I will turn this into a matrix.

我试过了:

val array = features.toArray // this gives an array of vectors and doesn't work

我还尝试将特征作为 DataFrame 对象而不是 Vector 输入,但由于 VectorAssembler 的特征打包,API 需要一个 Vector.例如,这个函数本身可以工作,但不符合 API,因为它期望 FeaturesType 是 Vector 而不是 DataFrame:

I've also tried to input the features as a DataFrame object rather than a Vector but the API is expecting a Vector, due to the packaging of the features from VectorAssembler. For example, this function inherently works, but doesn't conform to the API as it's expecting FeaturesType to be Vector as opposed to DataFrame:

def predictRaw(features: DataFrame) :DenseVector = {
  val featuresArray: Array[Double] = features.rdd.map(r => r.getAs[Vector](0).toArray).collect 
//rest of logic would go here
}

我的问题是 features 的类型是 Vector,而不是 DataFrame.另一种选择可能是将 features 打包为 DataFrame,但我不知道如何在不使用 VectorAssembler 的情况下做到这一点.

My problem is that features is of type Vector, not DataFrame. The other option might be to package features as a DataFrame but I don't know how to do that without using VectorAssembler.

感谢所有建议,谢谢!我看过 访问元素Spark DataFrame 中的向量(逻辑回归概率向量) 但这是在 python 中,我使用的是 Scala.

All suggestions appreciated, thanks! I have looked at Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) but this is in python and I'm using Scala.

推荐答案

Spark 3.0 添加了 vector_to_array UDF.无需自己实现https://github.com/apache/spark/pull/26910

Spark 3.0 added vector_to_array UDF. No need to implement yourself https://github.com/apache/spark/pull/26910

import org.apache.spark.ml.linalg.{SparseVector, Vector}
import org.apache.spark.mllib.linalg.{Vector => OldVector}

private val vectorToArrayUdf = udf { vec: Any =>
    vec match {
      case v: Vector => v.toArray
      case v: OldVector => v.toArray
      case v => throw new IllegalArgumentException(
        "function vector_to_array requires a non-null input argument and input type must be " +
        "`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`, " +
        s"but got ${ if (v == null) "null" else v.getClass.getName }.")
    }
  }.asNonNullable()

这篇关于将特征的 Spark 向量转换为数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆