pyspark将行转换为带有null的json [英] pyspark convert row to json with nulls

查看:172
本文介绍了pyspark将行转换为带有null的json的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标: 对于具有架构的数据框

Goal: For a dataframe with schema

id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string

我想添加一个新列,该列是该列的所有键和值的JSON字符串.我在这篇文章中使用了这种方法 PySpark-逐行转换为JSON 和相关问题. 我的代码

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions. My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

我遇到一个问题:

问题: 当某行的某个列的值为空(并且我的数据有很多...)时,Json字符串不包含键. IE.如果27列中只有9个具有值,那么JSON字符串仅具有9个键...我想做的是维护所有键,但对于null值,只需传递一个空字符串"

Issue: When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""

有什么提示吗?

推荐答案

您应该可以使用 pyspark.sql.functions.when .

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

考虑以下示例DataFrame:

Consider the following example DataFrame:

data = [
    ('one', 1, 10),
    (None, 2, 20),
    ('three', None, 30),
    (None, None, 40)
]

sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)

使用when实施 if-then-else 逻辑.如果该列不为null,则使用它.否则返回一个空字符串.

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
    "JSON",
    to_json(
        struct(
           [
                when(
                    col(x).isNotNull(),
                    col(x)
                ).otherwise(lit("")).alias(x) 
                for x in sdf.columns
            ]
        )
    )
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A    |B   |C  |JSON                         |
#+-----+----+---+-----------------------------+
#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |
#|null |2   |20 |{"A":"","B":"2","C":"20"}    |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"}     |
#+-----+----+---+-----------------------------+


另一种选择是使用

这篇关于pyspark将行转换为带有null的json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆