解析 JSON 字符串 Pyspark 数据框列,其中一列具有数组字符串 [英] parsing a JSON string Pyspark dataframe column that has string of array in one of the columns

查看:28
本文介绍了解析 JSON 字符串 Pyspark 数据框列,其中一列具有数组字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取 JSON 文件并将jsonString"和包含数组的底层字段解析为 pyspark 数据帧.

I am trying to read a JSON file and parse 'jsonString' and the underlying fields which includes array into a pyspark dataframe.

这里是json文件的内容.

Here is the contents of json file.

[{"jsonString": "{\"uid\":\"value1\",\"adUsername\":\"value3\",\"courseCertifications\":[{\"uid\":\"value2\",\"courseType\":\"TRAINING\"},{\"uid\":\"TEST\",\"courseType\":\"TRAINING\"}],\"modifiedBy\":\"value4\"}","transactionId": "value5", "tableName": "X"},
 {"jsonString": "{\"uid\":\"value11\",\"adUsername\":\"value13\",\"modifiedBy\":\"value14\"}","transactionId": "value15", "tableName": "X1"},
 {"jsonString": "{\"uid\":\"value21\",\"adUsername\":\"value23\",\"modifiedBy\":\"value24\"}","transactionId": "value25", "tableName": "X2"}]

我能够解析字符串 'jsonString' 的内容并使用以下逻辑选择所需的列

I am able to parse contents of string 'jsonString' and select required columns using the below logic

df = spark.read.json('path.json',multiLine=True)
df = df.withColumn('courseCertifications', explode(array(get_json_object(df['jsonString'],'$.courseCertifications'))))

现在我的最终目标是从courseCertifications"解析字段courseType"并为每个实例创建一行.

Now my end goal is to parse field "courseType" from "courseCertifications" and create one row per instance.

我使用以下逻辑来获取courseType"

I am using below logic to get "courseType"

df = df.withColumn('new',get_json_object(df.courseCertifications, '$[*].courseType'))

我能够获取courseType"的内容,但作为如下所示的字符串

I am able to get the contents of "courseType" but as a string as shown below

[Row(new=u'["TRAINING","TRAINING"]')]

我的最终目标是创建一个包含 transactionId、jsonString.uid、jsonString.adUsername、jsonString.courseCertifications.uid、jsonString.courseCertifications.courseType 列的数据框

My end goal is to create a dataframe with columns transactionId, jsonString.uid, jsonString.adUsername, jsonString.courseCertifications.uid, jsonString.courseCertifications.courseType

  • 我需要保留所有行并为 courseCertifications.uid/courseCertifications.courseType 的每个数组实例创建多行.

推荐答案

解决您的问题的一种优雅方式是创建 json 字符串的架构,然后使用 from_json 函数解析它

An elegant manner to resolve your question is creating the schema of the json string and then parse it using from_json function

import pyspark.sql.functions as f
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StringType, StructType, StructField

df = spark.read.json('your_path', multiLine=True)
schema = StructType([
    StructField('uid', StringType()),
    StructField('adUsername', StringType()),
    StructField('modifiedBy', StringType()),
    StructField('courseCertifications', ArrayType(
        StructType([
            StructField('uid', StringType()),
            StructField('courseType', StringType())
        ])
    ))
])

df = df \
    .withColumn('tmp', f.from_json(df.jsonString, schema)) \
    .withColumn('adUsername', f.col('tmp').adUsername) \
    .withColumn('uid', f.col('tmp').uid) \
    .withColumn('modifiedBy', f.col('tmp').modifiedBy) \
    .withColumn('tmp', f.explode(f.col('tmp').courseCertifications)) \
    .withColumn('course_uid', f.col('tmp').uid) \
    .withColumn('course_type', f.col('tmp').courseType) \
    .drop('jsonString', 'tmp')
df.show()

输出:

+-------------+------+----------+----------+----------+-----------+
|transactionId|uid   |adUsername|modifiedBy|course_uid|course_type|
+-------------+------+----------+----------+----------+-----------+
|value5       |value1|value3    |value4    |value2    |TRAINING   |
|value5       |value1|value3    |value4    |TEST      |TRAINING   |
+-------------+------+----------+----------+----------+-----------+

这篇关于解析 JSON 字符串 Pyspark 数据框列,其中一列具有数组字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆