列要素的类型必须为org.apache.spark.ml.linalg.VectorUDT [英] Column features must be of type org.apache.spark.ml.linalg.VectorUDT
问题描述
我想在pyspark(spark 2.1.1)中运行以下代码:
I want to run this code in pyspark (spark 2.1.1):
from pyspark.ml.feature import PCA
bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
pcaModel = bankPCA.fit(bankDf)
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")
pcaResult.show(truncate= false)
但是我得到这个错误:
要求失败:列要素必须是类型
org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7
,但实际上是org.apache.spark.mllib.linalg.VectorUDT@f71b0bce
.
requirement failed: Column features must be of type
org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7
but was actuallyorg.apache.spark.mllib.linalg.VectorUDT@f71b0bce
.
推荐答案
可以找到此处:
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
... other code ...
如上所示, df 是一个数据框,其中包含从pyspark.ml.linalg 导入的 Vectors.sparse()和Vectors.dense().
As you can see above, df is a dataframe which contains Vectors.sparse() and Vectors.dense() that are imported from pyspark.ml.linalg.
您的 bankDf 可能包含从pyspark.mllib.linalg 导入的矢量.
Probably, your bankDf contains Vectors imported from pyspark.mllib.linalg.
因此,您必须设置要导入数据框中的向量
So you have to set that Vectors in your dataframes are imported
from pyspark.ml.linalg import Vectors
代替:
from pyspark.mllib.linalg import Vectors
也许您会发现有趣的 stackoverflow问题.
这篇关于列要素的类型必须为org.apache.spark.ml.linalg.VectorUDT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!