Spark CountVectorizer 返回 udt 而不是向量 [英] Spark CountVectorizer return udt instead of vector
问题描述
我尝试为 Spark 2.3.0 中的 LDA 分析创建令牌计数向量.我遵循了一些教程,每次他们都使用 CountVectorizer 来轻松地将字符串数组转换为向量.
I try to create a vector of token counts for a LDA analysis in Spark 2.3.0. I have followed some tutorial and at each time they use CountVectorizer to easily convert Array of String to Vector.
我在我的 Databricks 笔记本上运行这个简短的例子:
I run this short example on my Databricks notebook :
import org.apache.spark.ml.feature.CountVectorizer
val testW = Seq(
(8, Array("Zara", "Nuha", "Ayan", "markle")),
(9, Array("fdas", "test", "Ayan", "markle")),
(10, Array("qwertzu", "test", "Ayan", "fdaf"))
).toDF("id", "filtered")
// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(5)
.setMinDF(2)
.fit(testW)
// Create vector of token counts
val articlesCountVector = vectorizer.transform(testW).select("id", "features")
display(articlesCountVector)
输出如下:输出
但在我读过的所有教程中,特征"的类型是向量.为什么在我的情况下是 udt ?
But in all tutorial I have read, the type of "features" is vector. Why in my case is it udt ?
我是不是忘记了什么?为什么它不是向量?
Did i forget something ? Why it is not a vector ?
可以转换吗?因为我无法使用这种 udt 类型创建 LDA 模型.
Is it possible to convert it ? because I cannot create LDA model with this udt type.
推荐答案
这里没有问题.您看到的是 Databricks 显示功能的实现细节.
There is no issue here. What is you see, is the detail of implementation of the Databricks display functions.
在内部,oasml.linalg.Vector
和 oasmllib.linalg.Vector
都没有在 Dataset
API 中原生表示,并且使用 UDT
s (UserDefinedTypes
).因此输出.
Internally, both o.a.s.ml.linalg.Vector
and o.a.s.mllib.linalg.Vector
are not natively represented in the Dataset
API, and use UDT
s (UserDefinedTypes
). Hence the output.
您可以在理解VectorAssembler的输出---Spark
这篇关于Spark CountVectorizer 返回 udt 而不是向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!