如何合并在数据帧的多个特征向量? [英] How to merge multiple feature vectors in DataFrame?
问题描述
我来到了一个数据帧,每行看起来是这样的:
using Spark ML transformers I arrived at a DataFrame where each row looks like this:
Row(object_id, text_features_vector, color_features, type_features)
其中, text_features
是项权重的稀疏向量, color_features
是一个小的20元(一热-en codeR)的颜色密集向量和 type_features
也是种一热恩codeR密集的载体。
where text_features
is a sparse vector of term weights, color_features
is a small 20-element (one-hot-encoder) dense vector of colors, and type_features
is also a one-hot-encoder dense vector of types.
什么会一个好方法是(用火花的设施)在一个单一的,大阵合并这些功能,让我衡量任何两个物体之间的事情就像在余弦距离
What would a good approach be (using spark's facilities) to merge these features in one single, large array, so that I measure things like the cosine distance between any two objects?
推荐答案
您可以使用的 VectorAssembler :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame
val df: DataFrame = ???
val assembler = new VectorAssembler()
.setInputCols(Array("text_features", "color_features", "type_features"))
.setOutputCol("features")
val transformed = assembler.transform(df)
有关PySpark例子中看到:在PySpark 恩code和组装多种功能
For PySpark example see: Encode and assemble multiple features in PySpark
这篇关于如何合并在数据帧的多个特征向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!