将特征的Spark向量转换为数组 [英] Convert a Spark Vector of features into an array
问题描述
我有一个features列,使用Spark的VectorAssembler将其打包到vector的Vector中,如下所示. data
是输入DataFrame(类型为spark.sql.DataFrame
).
I have a features column which is packaged into a Vector of vectors using Spark's VectorAssembler, as follows. data
is the input DataFrame (of type spark.sql.DataFrame
).
val featureCols = Array("feature_1","feature_2","feature_3")
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)
我正在使用Classifier
和ClassificationModel
开发人员API开发自定义分类器. ClassificationModel
需要开发predictRaw()
函数,该函数从模型输出预测标签的向量.
I am developing a custom Classifier using the Classifier
and ClassificationModel
developer API. ClassificationModel
requires development of a predictRaw()
function which outputs a vector of predicted labels from the model.
def predictRaw(features: FeaturesType) : Vector
此功能由API设置,并带有参数,FeaturesType
的特征并输出一个Vector(在我的情况下,我将其作为Spark DenseVector
,因为DenseVector
扩展了Vector
特质).
This function is set by the API and takes a parameter, features of FeaturesType
and outputs a Vector (which in my case I'm taking to be a Spark DenseVector
as DenseVector
extends the Vector
trait).
由于使用VectorAssembler进行包装,因此features
列的类型为Vector
,并且每个元素本身都是一个矢量,具有每个训练样本的原始特征.例如:
Due to the packaging by VectorAssembler, the features
column is of type Vector
and each element is itself a vector, of the original features for each training sample. For example:
功能列-类型为Vector
[1.0,2.0,3.0]-element1,本身是向量
[3.5,4.5,5.5]-element2,本身是向量
features column - of type Vector
[1.0, 2.0, 3.0] - element1, itself a vector
[3.5, 4.5, 5.5] - element2, itself a vector
我需要将这些功能提取到Array[Double]
中,以实现我的predictRaw()
逻辑.理想情况下,为了保持基数,我希望获得以下结果:
I need to extract these features into an Array[Double]
in order to implement my predictRaw()
logic. Ideally I would like the following result in order to preserve the cardinality:
`val result: Array[Double] = Array(1.0, 3.5, 2.0, 4.5, 3.0, 4.5)`
即按大列顺序排列,因为我将其转换为矩阵.
i.e. in column-major order as I will turn this into a matrix.
我尝试过:
val array = features.toArray // this gives an array of vectors and doesn't work
我也尝试将功能作为DataFrame对象而不是Vector输入,但是由于VectorAssembler中的功能包装,API期望使用Vector.例如,此函数固有地起作用,但是不符合API,因为它期望FeatureType为Vector而不是DataFrame:
I've also tried to input the features as a DataFrame object rather than a Vector but the API is expecting a Vector, due to the packaging of the features from VectorAssembler. For example, this function inherently works, but doesn't conform to the API as it's expecting FeaturesType to be Vector as opposed to DataFrame:
def predictRaw(features: DataFrame) :DenseVector = {
val featuresArray: Array[Double] = features.rdd.map(r => r.getAs[Vector](0).toArray).collect
//rest of logic would go here
}
我的问题是features
的类型是Vector
,而不是DataFrame
.另一个选择可能是将features
打包为DataFrame
,但我不知道如何在不使用VectorAssembler
的情况下进行打包.
My problem is that features
is of type Vector
, not DataFrame
. The other option might be to package features
as a DataFrame
but I don't know how to do that without using VectorAssembler
.
感谢所有建议,谢谢!我查看了 Spark DataFrame中的向量(逻辑回归概率向量),但这是在python中,我正在使用Scala.
All suggestions appreciated, thanks! I have looked at Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) but this is in python and I'm using Scala.
推荐答案
Spark 3.0添加了vector_to_array UDF.无需自己实现 https://github.com/apache/spark/pull/26910
Spark 3.0 added vector_to_array UDF. No need to implement yourself https://github.com/apache/spark/pull/26910
import org.apache.spark.ml.linalg.{SparseVector, Vector}
import org.apache.spark.mllib.linalg.{Vector => OldVector}
private val vectorToArrayUdf = udf { vec: Any =>
vec match {
case v: Vector => v.toArray
case v: OldVector => v.toArray
case v => throw new IllegalArgumentException(
"function vector_to_array requires a non-null input argument and input type must be " +
"`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`, " +
s"but got ${ if (v == null) "null" else v.getClass.getName }.")
}
}.asNonNullable()
这篇关于将特征的Spark向量转换为数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!