在Pyspark中使用UDF函数时,密集向量应为哪种类型? [英] What Type should the dense vector be, when using UDF function in Pyspark?

查看:187
本文介绍了在Pyspark中使用UDF函数时,密集向量应为哪种类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将pySpark中的List更改为Vector,然后使用此列对机器学习模型进行训练.但是我的Spark版本是1.6.0,没有VectorUDT().那么我应该在udf函数中返回什么类型呢?

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *


conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)


df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
    print '?????',column
    return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()

我尝试了很多Types,而这DenseVector()给我错误:

I have tried many Types , and this DenseVector() give me error:

Traceback (most recent call last):
  File "t.py", line 21, in <module>
    getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)

请帮助我.

推荐答案

您可以将vectors和VectorUDT与UDF一起使用,

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

关于VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

这篇关于在Pyspark中使用UDF函数时,密集向量应为哪种类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆