如何从PySpark中的向量列中提取浮点数? [英] How to extract floats from vector columns in PySpark?

查看:113
本文介绍了如何从PySpark中的向量列中提取浮点数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Spark DataFrame具有以下格式的数据:

My Spark DataFrame has data in the following format:

printSchema()显示每一列的类型为 vector .

The printSchema() shows that each column is of the type vector.

我尝试使用以下代码(针对1列 col1 )从 [] 中获取值:

I tried to get the values out of [ and ] using the code below (for 1 columns col1):

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

firstelement=udf(lambda v:float(v[0]),FloatType())
df.select(firstelement('col1')).show()

但是,如何将其应用于 df 的所有列?

However, how can I apply it to all columns of df?

推荐答案

1.提取单个向量列的第一个元素:

要获取向量列的第一个元素,可以使用以下SO的答案:讨论这是一个可复制的示例:

Here's a reproducible example:

>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType
>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
                                {"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
|  col1|  col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+

>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df.withColumn("col1", firstelement("col1")).show()
+----+------+
|col1|  col2|
+----+------+
| 0.2|[0.25]|
|0.45|[0.85]|
+----+------+


2.提取多个向量列的第一个元素:

要将上述解决方案推广到多列,请应用 for循环.这是一个示例:

>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType

>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
                                {"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
|  col1|  col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+

>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df = df.select([firstelement(c).alias(c) for c in df.columns])
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 0.2|0.25|
|0.45|0.85|
+----+----+

这篇关于如何从PySpark中的向量列中提取浮点数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆