pyspark将行转换为带有空值的json [英] pyspark convert row to json with nulls

查看:47
本文介绍了pyspark将行转换为带有空值的json的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:对于具有架构的数据框

id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string

我想添加一个新列,它是列的所有键和值的 JSON 字符串.我在这篇文章中使用了该方法 PySpark - 逐行转换为 JSON 和相关问题.我的代码

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions. My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

我遇到了一个问题:

问题:当任何行有一列的空值(我的数据有很多......)时,Json 字符串不包含键.IE.如果 27 列中只有 9 列有值,那么 JSON 字符串只有 9 个键......我想做的是维护所有键,但对于空值,只需传递一个空字符串"

Issue: When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""

有什么建议吗?

推荐答案

您应该能够使用 pyspark.sql.functions.when.

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

考虑以下示例数据帧:

data = [
    ('one', 1, 10),
    (None, 2, 20),
    ('three', None, 30),
    (None, None, 40)
]

sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)

使用when来实现if-then-else 逻辑.如果该列不为空,则使用该列.否则返回空字符串.

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
    "JSON",
    to_json(
        struct(
           [
                when(
                    col(x).isNotNull(),
                    col(x)
                ).otherwise(lit("")).alias(x) 
                for x in sdf.columns
            ]
        )
    )
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A    |B   |C  |JSON                         |
#+-----+----+---+-----------------------------+
#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |
#|null |2   |20 |{"A":"","B":"2","C":"20"}    |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"}     |
#+-----+----+---+-----------------------------+

<小时>

另一种选择是使用 pyspark.sql.functions.coalesce 而不是 when:

from pyspark.sql.functions import coalesce

sdf.withColumn(
    "JSON",
    to_json(
        struct(
           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
        )
    )
).show(truncate=False)
## Same as above

这篇关于pyspark将行转换为带有空值的json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆