使 VectorAssembler 始终选择 DenseVector [英] Make VectorAssembler always choose DenseVector

查看:141
本文介绍了使 VectorAssembler 始终选择 DenseVector的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我使用 df.columns 的数据框结构.

This is the structure of my dataframe using df.columns.

['LastName',
 'FirstName',
 'Stud. ID',
 '10 Relations',
 'Related to Politics',
 '3NF',
 'Documentation & Scripts',
 'SQL',
 'Data (CSV, etc.)',
 '20 Relations',
 'Google News',
 'Cheated',
 'Sum',
 'Delay Factor',
 'Grade (out of 2)']

我已经使用

assembler = VectorAssembler(inputCols=['10 Relations',
 'Related to Politics',
 '3NF'],outputCol='features')

output = assembler.transform(df).现在它包含一些 Row 对象.这些对象具有这种架构(这是我运行 output.printSchema() 时得到的)

and output = assembler.transform(df). Now it contains some Row objects. These objects have this architecture (This is what I get when I run output.printSchema())

root
 |-- LastName: string (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- Stud. ID: integer (nullable = true)
 |-- 10 Relations: integer (nullable = true)
 |-- Related to Politics: integer (nullable = true)
 |-- 3NF: integer (nullable = true)
 |-- Documentation & Scripts: integer (nullable = true)
 |-- SQL: integer (nullable = true)
 |-- Data (CSV, etc.): integer (nullable = true)
 |-- 20 Relations: integer (nullable = true)
 |-- Google News: integer (nullable = true)
 |-- Cheated: integer (nullable = true)
 |-- Sum: integer (nullable = true)
 |-- Delay Factor: double (nullable = true)
 |-- Grade (out of 2): double (nullable = true)
 |-- features: vector (nullable = true)

对于每一行,汇编器选择使特征向量稀疏或密集(出于内存原因).但这是一个大问题.因为我想使用这个转换后的数据来制作线性回归模型.所以,我正在寻找一种方法让 VectorAssembler 始终选择 Dense Vector.

For each row, the assembler chooses to make the features vector Sparse or Dense (For memory reasons). But this is a big problem. Because I want to use this transformed data for making a linear regression model. So, I'm searching for a way to make VectorAssembler always choose Dense Vector.

有什么想法吗?

注意:我已阅读这篇文章.但问题是,由于Row类是tuple的子类,所以Row对象创建后就不能再修改了.

Note: I have read this post. But the problem is that since the Row class is a subclass of tuple, I cannot change a Row object after it is made.

推荐答案

Sparse 和 Dense vector 都继承自 pyspark.ml.linalg.Vector.所以这两种向量类型都有共同的 .toarray() 方法.您可以将它们转换为 numpy 数组,然后使用简单的 udf 将它们转换为 Dense vetor.

Sparse and Dense vector are both inherited from pyspark.ml.linalg.Vector. So both vector types have .toarray() method in common. You can convert them into numpy array then Dense vetor with simple udf.

from pyspark.ml.linalg import DenseVector, SparseVector, Vectors, VectorUDT
from pyspark.sql import functions as F
from pyspark.sql.types import *


v = Vectors.dense([1,3]) # dense vector
u = SparseVector(2, {}) # sparse vector

# toDense function converts both vector  type into Dense Vector
toDense = lambda v: Vectors.dense(v.toArray()) 
toDense(u), toDense(v)

结果:

DenseVector([0.0, 0.0]), DenseVector([1.0, 3.0])

然后你就可以用这个函数创建udf了.

Then You can create udf with this function.

df = sqlContext.createDataFrame([
    ((v,)), 
    ((u,))
   ], ['feature'])

toDense = lambda v: Vectors.dense(v.toArray())
toDenseUdf = F.udf(toDense, VectorUDT())
df.withColumn('feature', toDenseUdf('feature')).show()

结果:

+---------+
|  feature|
+---------+
|[1.0,3.0]|
|[0.0,0.0]|
+---------+

列中有单个向量类型.

这篇关于使 VectorAssembler 始终选择 DenseVector的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆