如何在 PySpark DataFrame 中将 ArrayType 转换为 DenseVector? [英] How to convert ArrayType to DenseVector in PySpark DataFrame?
问题描述
我在尝试构建 ML Pipeline
时遇到以下错误:
I'm getting the following error trying to build a ML Pipeline
:
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).'
My features
列包含一个浮点值数组.听起来我需要将它们转换为某种类型的向量(它不是稀疏的,所以是 DenseVector?).有没有办法直接在 DataFrame 上执行此操作,还是需要转换为 RDD?
My features
column contains an array of floating point values. It sounds like I need to convert those to some type of vector (it's not sparse, so a DenseVector?). Is there a way to do this directly on the DataFrame or do I need to convert to an RDD?
推荐答案
您可以使用 UDF:
udf(lambda vs: Vectors.dense(vs), VectorUDT())
在 Spark
2.0 导入:
In Spark < 2.0 import:
from pyspark.mllib.linalg import Vectors, VectorUDT
在 Spark 2.0+ 中导入:
In Spark 2.0+ import:
from pyspark.ml.linalg import Vectors, VectorUDT
请注意,尽管实现相同,但这些类并不兼容.
Please note that these classes are not compatible despite identical implementation.
还可以提取单个特征并使用 VectorAssembler
进行组装.假设输入列被称为 features
:
It is also possible to extract individual features and assemble with VectorAssembler
. Assuming input column is called features
:
from pyspark.ml.feature import VectorAssembler
n = ... # Size of features
assembler = VectorAssembler(
inputCols=["features[{0}]".format(i) for i in range(n)],
outputCol="features_vector")
assembler.transform(df.select(
"*", *(df["features"].getItem(i) for i in range(n))
))
这篇关于如何在 PySpark DataFrame 中将 ArrayType 转换为 DenseVector?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!