如何从PySpark中的向量列中提取浮点数? [英] How to extract floats from vector columns in PySpark?
本文介绍了如何从PySpark中的向量列中提取浮点数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的Spark DataFrame具有以下格式的数据:
My Spark DataFrame has data in the following format:
printSchema()
显示每一列的类型为 vector
.
The printSchema()
shows that each column is of the type vector
.
我尝试使用以下代码(针对1列 col1
)从 [
和]
中获取值:
I tried to get the values out of [
and ]
using the code below (for 1 columns col1
):
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
firstelement=udf(lambda v:float(v[0]),FloatType())
df.select(firstelement('col1')).show()
但是,如何将其应用于 df
的所有列?
However, how can I apply it to all columns of df
?
推荐答案
1.提取单个向量列的第一个元素:
要获取向量列的第一个元素,可以使用以下SO的答案:讨论这是一个可复制的示例:
Here's a reproducible example:
>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType
>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
{"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
| col1| col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+
>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df.withColumn("col1", firstelement("col1")).show()
+----+------+
|col1| col2|
+----+------+
| 0.2|[0.25]|
|0.45|[0.85]|
+----+------+
2.提取多个向量列的第一个元素:
要将上述解决方案推广到多列,请应用 for循环
.这是一个示例:
>>> from pyspark.sql import functions as f
>>> from pyspark.sql.types import FloatType
>>> df = spark.createDataFrame([{"col1": [0.2], "col2": [0.25]},
{"col1": [0.45], "col2":[0.85]}])
>>> df.show()
+------+------+
| col1| col2|
+------+------+
| [0.2]|[0.25]|
|[0.45]|[0.85]|
+------+------+
>>> firstelement=f.udf(lambda v:float(v[0]),FloatType())
>>> df = df.select([firstelement(c).alias(c) for c in df.columns])
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 0.2|0.25|
|0.45|0.85|
+----+----+
这篇关于如何从PySpark中的向量列中提取浮点数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文