Spark ML VectorAssembler返回奇怪的输出 [英] Spark ML VectorAssembler returns strange output

查看:304
本文介绍了Spark ML VectorAssembler返回奇怪的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在经历VectorAssembler的一个非常奇怪的行为,我想知道是否还有其他人看到过此情况.

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this.

我的情况非常简单.我从CSV文件中解析数据,其中有一些标准的IntDouble字段,并且我还计算了一些额外的列.我的解析函数返回以下内容:

My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this:

val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))

我的主要功能使用如下解析功能:

My main function uses the parsing function like this:

val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")

然后我像这样使用VectorAssembler:

val assembler = new VectorAssembler()
                           .setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
                           .setOutputCol("features")

val assemblerData = assembler.transform(data)

因此,当我将一行数据打印到VectorAssembler中之前,它看起来像这样:

So when I print a row of my data before it goes into the VectorAssembler it looks like this:

[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]

在VectorAssembler的转换功能之后,我打印了同一行数据并得到了:

After the transform function of VectorAssembler I print the same row of data and get this:

[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]

到底是怎么回事? VectorAssembler做了什么?我仔细检查了所有计算,甚至遵循了简单的Spark示例,但看不到我的代码出了什么问题.你能?

What on earth is going on? What has the VectorAssembler done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?

推荐答案

输出没有什么奇怪的.您的向量似乎有很多零元素,因此spark使用了它的稀疏表示.

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation.

进一步说明:

您的向量似乎由18个元素(维度)组成.

It seems like your vector is composed of 18 elements (dimension).

向量中的该索引[0,1,6,9,14,17]包含按[17.0,15.0,3.0,1.0,4.0,2.0]

This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]

稀疏向量表示法是一种节省计算空间的方式,因此可以更轻松,更快地进行计算.有关稀疏表示的更多信息

Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.

现在,您当然可以将稀疏表示转换为密集表示,但这需要付出一定的代价.

Now of course you can convert that sparse representation to a dense representation but it comes at a cost.

如果您有兴趣了解功能的重要性,因此建议您查看.

In case you are interested in getting feature importance, thus I advise you to take a look at this.

这篇关于Spark ML VectorAssembler返回奇怪的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆