如何访问 Spark DataFrame 中 VectorUDT 列的元素? [英] How to access element of a VectorUDT column in a Spark DataFrame?
问题描述
我有一个数据框 df
,其中有一个名为 features
的 VectorUDT
列.如何获取列的元素,比如说第一个元素?
I have a dataframe df
with a VectorUDT
column named features
. How do I get an element of the column, say first element?
我尝试过以下操作
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
但我得到一个 net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype)
错误.如果我执行 first_elem_udf = first_elem_udf(lambda row: row.toArray()[0])
代替,同样的错误.
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype)
error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0])
instead.
我也尝试过 explode()
但我得到一个错误,因为它需要一个数组或映射类型.
I also tried explode()
but I get an error because it requires an array or map type.
我认为这应该是一个常见的操作.
This should be a common operation, I think.
推荐答案
将输出转换为float
:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
示例用法:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
说明:
输出值必须重新序列化为等效的 Java 对象.如果你想访问 values
(当心 SparseVectors
)你应该使用 item
方法:
Output values have to be reserialized to equivalent Java objects. If you want to access values
(beware of SparseVectors
) you should use item
method:
v.values.item(0)
返回标准 Python 标量.同样,如果您想以密集结构访问所有值:
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
这篇关于如何访问 Spark DataFrame 中 VectorUDT 列的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!