如何访问 Spark DataFrame 中 VectorUDT 列的元素? [英] How to access element of a VectorUDT column in a Spark DataFrame?

查看：47 发布时间：2021/11/14 21:16:59 apache-spark dataframe pyspark apache-spark-sql apache-spark-ml

本文介绍了如何访问 Spark DataFrame 中 VectorUDT 列的元素?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框 df，其中有一个名为 features 的 VectorUDT 列.如何获取列的元素，比如说第一个元素?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?

我尝试过以下操作

from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()

但我得到一个 net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) 错误.如果我执行 first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) 代替，同样的错误.

but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.

我也尝试过 explode() 但我得到一个错误，因为它需要一个数组或映射类型.

I also tried explode() but I get an error because it requires an array or map type.

我认为这应该是一个常见的操作.

This should be a common operation, I think.

推荐答案

将输出转换为float:

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf

def ith_(v, i):
    try:
        return float(v[i])
    except ValueError:
        return None

ith = udf(ith_, DoubleType())

示例用法:

from pyspark.ml.linalg import Vectors

df = sc.parallelize([
    (1, Vectors.dense([1, 2, 3])),
    (2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])

df.select(ith("features", lit(1))).show()

## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## |              2.0|
## |              9.0|
## +-----------------+

说明:

输出值必须重新序列化为等效的 Java 对象.如果你想访问 values(当心 SparseVectors)你应该使用 item 方法:

Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:

v.values.item(0)

返回标准 Python 标量.同样，如果您想以密集结构访问所有值:

which return standard Python scalars. Similarly if you want to access all values as a dense structure:

v.toArray().tolist()

这篇关于如何访问 Spark DataFrame 中 VectorUDT 列的元素?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何访问 Spark DataFrame 中 VectorUDT 列的元素? [英] How to access element of a VectorUDT column in a Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何访问 Spark DataFrame 中 VectorUDT 列的元素? [英] How to access element of a VectorUDT column in a Spark DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭