VectorAssembler 仅输出到 DenseVector? [英] VectorAssembler output only to DenseVector?

查看:39
本文介绍了VectorAssembler 仅输出到 DenseVector?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

VectorAssembler 的功能有一些很烦人的地方.我目前正在将一组列转换为单列向量,然后使用 StandardScaler 函数应用缩放到包含的功能.然而,似乎有内存的SPARK原因,决定是否应该使用 DenseVector 或 SparseVector 来表示每行特征.但是,当您需要使用 StandardScaler 时,SparseVector(s) 的输入无效,只允许 DenseVectors.有人知道解决办法吗?

There is something very annoying with the function of VectorAssembler. I am currently transforming a set of columns into a single column of vectors and then use the StandardScaler function to apply the scaling to the included features. However, there seems that SPARK for memory reasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features. But, when you need to use StandardScaler, the input of SparseVector(s) is invalid, only DenseVectors are allowed. Does anybody know a solution to that?

我决定只使用 UDF 函数,这将稀疏向量变成稠密向量.有点傻,但有效.

I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.

推荐答案

您说得对,VectorAssembler 根据使用较少内存的哪个输出格式选择密集与稀疏输出格式.

You're right that VectorAssembler chooses dense vs sparse output format based on whichever one uses less memory.

SparseVectorDenseVector 的转换不需要UDF;只需使用 toArray() 方法:

You don't need a UDF to convert from SparseVector to DenseVector; just use toArray() method:

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

此外,StandardScaler 接受 SparseVector,除非您在创建时设置 withMean=True.如果确实需要去均值,则必须从所有分量中减去一个(大概是非零的)数,这样稀疏向量就不会再稀疏了.

Also, StandardScaler accepts SparseVector unless you set withMean=True at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.

这篇关于VectorAssembler 仅输出到 DenseVector?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆