Python Spark DataFrame:用 SparseVector 替换 null [英] Python Spark DataFrame: replace null with SparseVector

查看:26
本文介绍了Python Spark DataFrame:用 SparseVector 替换 null的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 spark 中,我有以下名为df"的数据框,其中包含一些空条目:

In spark, I have following data frame called "df" with some null entries:

+-------+--------------------+--------------------+                     
|     id|           features1|           features2|
+-------+--------------------+--------------------+
|    185|(5,[0,1,4],[0.1,0...|                null|
|    220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
|    225|                null|(10,[1,3,5],[0.1,...|
+-------+--------------------+--------------------+

df.features1 和 df.features2 是类型向量(可为空).然后我尝试使用以下代码用 SparseVectors 填充空条目:

df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with SparseVectors:

df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})

此代码导致以下错误:

AttributeError: 'SparseVector' object has no attribute '_get_object_id'

然后我在 spark 文档中发现了以下段落:

Then I found following paragraph in spark documentation:

fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

Parameters: 
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.

这是否解释了我未能用 DataFrame 中的 SparseVectors 替换空条目?或者这是否意味着在 DataFrame 中没有办法做到这一点?

Does this explain my failure to replace null entries with SparseVectors in DataFrame? Or does this mean that there's no way to do this in DataFrame?

我可以通过将 DataFrame 转换为 RDD 并用 SparseVectors 替换 None 值来实现我的目标,但是直接在 DataFrame 中执行此操作对我来说会方便得多.

I can achieve my goal by converting DataFrame to RDD and replacing None values with SparseVectors, but it will be much more convenient for me to do this directly in DataFrame.

是否有任何方法可以直接在 DataFrame 中执行此操作?谢谢!

Is there any method to do this directly in DataFrame? Thanks!

推荐答案

你可以使用udf:

from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *

fill_with_vector = udf(
    lambda x, i: x if x is not None else SparseVector(i, {}),
    VectorUDT()
)

df = sc.parallelize([
    (SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])

(df
    .withColumn("features1", fill_with_vector("features1", lit(5)))
    .withColumn("features2", fill_with_vector("features2", lit(10)))
    .show())

# +-------------+---------------+
# |    features1|      features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# |    (5,[],[])|     (10,[],[])|
# +-------------+---------------+

这篇关于Python Spark DataFrame:用 SparseVector 替换 null的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆