列要素的类型必须为org.apache.spark.ml.linalg.VectorUDT [英] Column features must be of type org.apache.spark.ml.linalg.VectorUDT

查看:831
本文介绍了列要素的类型必须为org.apache.spark.ml.linalg.VectorUDT的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在pyspark(spark 2.1.1)中运行以下代码:

I want to run this code in pyspark (spark 2.1.1):

from pyspark.ml.feature import PCA

bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
pcaModel = bankPCA.fit(bankDf)    
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")    
pcaResult.show(truncate= false)

但是我得到这个错误:

要求失败:列要素必须是类型 org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7,但实际上是 org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

requirement failed: Column features must be of type org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

推荐答案

可以找到此处:

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

... other code ...

如上所示, df 是一个数据框,其中包含从pyspark.ml.linalg 导入的 Vectors.sparse()和Vectors.dense().

As you can see above, df is a dataframe which contains Vectors.sparse() and Vectors.dense() that are imported from pyspark.ml.linalg.

您的 bankDf 可能包含从pyspark.mllib.linalg 导入的矢量.

Probably, your bankDf contains Vectors imported from pyspark.mllib.linalg.

因此,您必须设置要导入数据框中的向量

So you have to set that Vectors in your dataframes are imported

from pyspark.ml.linalg import Vectors 

代替:

from pyspark.mllib.linalg import Vectors

也许您会发现有趣的 stackoverflow问题.

这篇关于列要素的类型必须为org.apache.spark.ml.linalg.VectorUDT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆