Spark CountVectorizer返回udt而不是vector [英] Spark CountVectorizer return udt instead of vector
问题描述
我尝试为Spark 2.3.0中的LDA分析创建令牌计数向量.我遵循了一些教程,每次他们使用CountVectorizer轻松将String Array转换为Vector时.
I try to create a vector of token counts for a LDA analysis in Spark 2.3.0. I have followed some tutorial and at each time they use CountVectorizer to easily convert Array of String to Vector.
我在Databricks笔记本上运行了这个简短示例:
I run this short example on my Databricks notebook :
import org.apache.spark.ml.feature.CountVectorizer
val testW = Seq(
(8, Array("Zara", "Nuha", "Ayan", "markle")),
(9, Array("fdas", "test", "Ayan", "markle")),
(10, Array("qwertzu", "test", "Ayan", "fdaf"))
).toDF("id", "filtered")
// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(5)
.setMinDF(2)
.fit(testW)
// Create vector of token counts
val articlesCountVector = vectorizer.transform(testW).select("id", "features")
display(articlesCountVector)
,输出如下: 输出
但是在我阅读的所有教程中,功能"的类型都是 vector . 为什么是我的 udt ?
But in all tutorial I have read, the type of "features" is vector. Why in my case is it udt ?
我忘记了什么吗?为什么它不是向量?
Did i forget something ? Why it is not a vector ?
是否可以转换它?因为我无法使用这种udt类型创建LDA模型.
Is it possible to convert it ? because I cannot create LDA model with this udt type.
推荐答案
这里没有问题.您将看到Databricks显示功能的实现细节.
There is no issue here. What is you see, is the detail of implementation of the Databricks display functions.
在内部,o.a.s.ml.linalg.Vector
和o.a.s.mllib.linalg.Vector
均未在Dataset
API中本地表示,而是使用UDT
s(UserDefinedTypes
).因此是输出.
Internally, both o.a.s.ml.linalg.Vector
and o.a.s.mllib.linalg.Vector
are not natively represented in the Dataset
API, and use UDT
s (UserDefinedTypes
). Hence the output.
您可以在了解VectorAssembler的输出--- Spark
这篇关于Spark CountVectorizer返回udt而不是vector的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!