Spark CountVectorizer返回udt而不是vector [英] Spark CountVectorizer return udt instead of vector

查看:100
本文介绍了Spark CountVectorizer返回udt而不是vector的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试为Spark 2.3.0中的LDA分析创建令牌计数向量.我遵循了一些教程,每次他们使用CountVectorizer轻松将String Array转换为Vector时.

I try to create a vector of token counts for a LDA analysis in Spark 2.3.0. I have followed some tutorial and at each time they use CountVectorizer to easily convert Array of String to Vector.

我在Databricks笔记本上运行了这个简短示例:

I run this short example on my Databricks notebook :

import org.apache.spark.ml.feature.CountVectorizer

val testW = Seq(
  (8, Array("Zara", "Nuha", "Ayan", "markle")),
  (9, Array("fdas", "test", "Ayan", "markle")),
  (10, Array("qwertzu", "test", "Ayan", "fdaf"))
  ).toDF("id", "filtered")

// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
  .setInputCol("filtered")
  .setOutputCol("features")
  .setVocabSize(5) 
  .setMinDF(2) 
  .fit(testW)

// Create vector of token counts
val articlesCountVector = vectorizer.transform(testW).select("id", "features")
display(articlesCountVector)

,输出如下: 输出

但是在我阅读的所有教程中,功能"的类型都是 vector . 为什么是我的 udt ?

But in all tutorial I have read, the type of "features" is vector. Why in my case is it udt ?

我忘记了什么吗?为什么它不是向量?

Did i forget something ? Why it is not a vector ?

是否可以转换它?因为我无法使用这种udt类型创建LDA模型.

Is it possible to convert it ? because I cannot create LDA model with this udt type.

推荐答案

这里没有问题.您将看到Databricks显示功能的实现细节.

There is no issue here. What is you see, is the detail of implementation of the Databricks display functions.

在内部,o.a.s.ml.linalg.Vectoro.a.s.mllib.linalg.Vector均未在Dataset API中本地表示,而是使用UDT s(UserDefinedTypes).因此是输出.

Internally, both o.a.s.ml.linalg.Vector and o.a.s.mllib.linalg.Vector are not natively represented in the Dataset API, and use UDTs (UserDefinedTypes). Hence the output.

您可以在了解VectorAssembler的输出--- Spark

这篇关于Spark CountVectorizer返回udt而不是vector的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆