如何在DataFrame中合并多个特征向量? [英] How to merge multiple feature vectors in DataFrame?

查看:77
本文介绍了如何在DataFrame中合并多个特征向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Spark ML 转换器,我得到了一个 DataFrame,其中每一行如下所示:

Using Spark ML transformers I arrived at a DataFrame where each row looks like this:

Row(object_id, text_features_vector, color_features, type_features)

其中 text_features 是词权重的稀疏向量,color_features 是一个小的 20 元素(one-hot-encoder)密集颜色向量,以及 type_features 也是一个单热编码器类型的密集向量.

where text_features is a sparse vector of term weights, color_features is a small 20-element (one-hot-encoder) dense vector of colors, and type_features is also a one-hot-encoder dense vector of types.

(使用 Spark 的工具)将这些特征合并到一个单一的大型数组中,以便我测量诸如任意两个对象之间的余弦距离之类的东西是什么好方法?

What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the cosine distance between any two objects?

推荐答案

您可以使用 VectorAssembler:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame

val df: DataFrame = ???

val assembler = new VectorAssembler()
  .setInputCols(Array("text_features", "color_features", "type_features"))
  .setOutputCol("features")

val transformed = assembler.transform(df)

有关 PySpark 示例,请参阅:在 PySpark 中编码和组合多个功能

For PySpark example see: Encode and assemble multiple features in PySpark

这篇关于如何在DataFrame中合并多个特征向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆