在 Pyspark 中使用 UDF 函数时,密集向量应该是什么类型? [英] What Type should the dense vector be, when using UDF function in Pyspark?

查看:27
本文介绍了在 Pyspark 中使用 UDF 函数时,密集向量应该是什么类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在pySpark中把List改成Vector,然后用这个列对机器学习模型进行训练.但是我的 spark 版本是 1.6.0,它没有 VectorUDT().那么我应该在我的 udf 函数中返回什么类型呢?

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *


conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)


df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
    print '?????',column
    return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()

我尝试了很多 Types ,这个 DenseVector() 给了我错误:

I have tried many Types , and this DenseVector() give me error:

Traceback (most recent call last):
  File "t.py", line 21, in <module>
    getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)

请帮帮我.

推荐答案

您可以将向量和 VectorUDT 与 UDF 结合使用,

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

关于 VectorUDT,http:///spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

这篇关于在 Pyspark 中使用 UDF 函数时,密集向量应该是什么类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆