规范化复杂的嵌套 JSON 文件 [英] Normalize a complex nested JSON file

查看：21 发布时间：2021/11/14 22:13:34 python pandas apache-spark pyspark apache-spark-sql

本文介绍了规范化复杂的嵌套 JSON 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将下面的 json 文件规范化为 4 个表 - 内容"、模块"、图像"和另一个表中的所有其他内容"

Im trying to normalize the below json file into 4 tables - "content", "Modules", "Images" and "Everything Else in another table"

{
    "id": "0000050a",
    "revision": 1580225050941,
    "slot": "product-description",
    "type": "E",
    "create_date": 1580225050941,
    "modified_date": 1580225050941,
    "creator": "Auto",
    "modifier": "Auto",
    "audit_info": {
        "date": 1580225050941,
        "source": "AutoService",
        "username": "Auto"
    },
    "total_ID": 1,
    "name": "Auto_A1AM78C64UM0Y8_B07JCJR5HW",
    "content": [{
        "ID": ["B01"],
        "content_revision": 1580225050941,
        "template": {
            "module": [{
                "id": "module-11",
                "text": null,
                "header": [{
                    "id": "title",
                    "value": null,
                    "decorators": []
                }],
                "paragraph": [{
                    "id": "description",
                    "value": [],
                    "decorators": []
                }],
                "image": [{
                    "id": "image",
                    "assetId": "/images/2cdabb786d10.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,970,300_PT0_SX970_V1__",
                    "src": "image.jpg",
                    "originalSrc": null
                }],
                "integer": null,
                "chart": null,
                "list": null,
                "video": null,
                "gallery": null,
                "composite": null,
                "collection": null,
                "product": null
            }, {
                "id": "module-6",
                "text": null,
                "header": [{
                    "id": "title1",
                    "value": "Dest ",
                    "decorators": []
                }, {
                    "id": "title2",
                    "value": "cc",
                    "decorators": []
                }, {
                    "id": "title3",
                    "value": "Col",
                    "decorators": []
                }, {
                    "id": "title4",
                    "value": "C",
                    "decorators": []
                }, {
                    "id": "caption1",
                    "value": null,
                    "decorators": []
                }, {
                    "id": "caption2",
                    "value": null,
                    "decorators": []
                }, {
                    "id": "caption3",
                    "value": null,
                    "decorators": []
                }, {
                    "id": "caption4",
                    "value": null,
                    "decorators": []
                }],
                "paragraph": [{
                    "id": "description1",
                    "value": [" Sport"],
                    "decorators": [
                        []
                    ]
                }, {
                    "id": "description2",
                    "value": ["elements "],
                    "decorators": [
                        []
                    ]
                }, {
                    "id": "description3",
                    "value": ["Film "],
                    "decorators": [
                        []
                    ]
                }, {
                    "id": "description4",
                    "value": ["Our signature "],
                    "decorators": [
                        []
                    ]
                }],
                "image": [{
                    "id": "image1",
                    "assetId": "/images/dbbfc9873e31.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image2_.jpg",
                    "originalSrc": null
                }, {
                    "id": "image2",
                    "assetId": "/images/f577ae005.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "test.jpg",
                    "originalSrc": null
                }, {
                    "id": "image3",
                    "assetId": "/images/-0df21c5216d0.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image.jpg",
                    "originalSrc": null
                }, {
                    "id": "image4",
                    "assetId": "/images/78d26b9c-408c-4299-8ea8-e9257f170320.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image.jpg",
                    "originalSrc": null
                }, {
                    "id": "thumb1",
                    "assetId": "/images/-bbbfc9873e31.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image.jpg",
                    "originalSrc": null
                }, {
                    "id": "thumb2",
                    "assetId": "/images/e56f577ae005.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image_.jpg",
                    "originalSrc": null
                }, {
                    "id": "thumb3",
                    "assetId": "/images/0df21c5216d0.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image_.jpg",
                    "originalSrc": null
                }, {
                    "id": "thumb4",
                    "assetId": "/images/-e9257f170320.jpg",
                    "alt": " ",
                    "viewLarger": null,
                    "styleCodes": "__CR0,0,300,300_PT0_SX300_V1__",
                    "src": "image.jpg",
                    "originalSrc": null
                }],
                "integer": null,
                "chart": null,
                "list": null,
                "video": null,
                "gallery": null,
                "composite": null,
                "collection": null,
                "product": null
            }],
            "renderType": "VERTICAL"
        },
        "locale_data": {
            "locale": "en_US",
            "identified_by": "MACHINE_DETECT"
        }
    }],
    "badges": []
}

我可以成功地将 JSON 扁平化为一个大数据框，以标头作为 JSON 路径.但我想将 JSON 规范化为单独的表.例如，模块表应该有 ID、Text、Header_ID、Header_Value 等列.Image 表应该有Image_ID、Assest_ID、Src 等列.任何人都可以帮我将此 JSON 规范化为 4 个表.

I can successfully flatten the JSON into one big data frame with the header as the JSON path. but I want to normalize JSON into separate tables. Like for example the Module table should have columns like ID, Text, Header_ID, Header_Value and so on. The Image table should have columns like Image_ID, Assest_ID, Src and so on. Can anyone please help me normalize this JSON into the 4 tables.

推荐答案

您可以使用 https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10，如下，然后使用json_normalize:

You could use the function defined by https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10, as follows, and then use json_normalize :

import pandas as pd
import json
with open('test.json') as json_file:
    data = json.load(json_file)

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

module =  flatten_json(data["content"][0])
module = pd.json_normalize(module)

然后，您要做的就是根据您描述的四个类别选择列.输出为:

Then, what you have to do is select the columns according to the four categories you described. The output is:

ID_0  content_revision  ... locale_data_locale locale_data_identified_by
0  B01     1580225050941  ...              en_US            MACHINE_DETECT

然后您选择如下，例如为您的模块和图像数据帧:

Then you select as follows, for instance for your module and image DataFrames:

module = df.loc[:,df.columns.str.contains("module")]
image = df.loc[:,df.columns.str.contains("image")]

例如你得到的模块结果是:

The result you get for module for instance is :

template_module_0_id  ... template_module_1_product
0            module-11  ...                      None

那么，我举一个DataFrame模块转换的例子，你只有两个模块，所以你可以在重命名列后做一个concat:

Then, I give the example for the transformation of the module DataFrame, you only have two modules so you can do a concat after renaming the columns:

module1 = module.loc[:,module.columns.str.contains("module_0")]
module1.columns = module1.columns.str.replace("_0","")
module2 = module.loc[:,module.columns.str.contains("module_1")]
module2.columns = module2.columns.str.replace("_1","")
modules = pd.concat([module1,
                     module2])

你会得到:

 template_module_id  ... template_module_image_7_originalSrc
0          module-11  ...                                 NaN
0           module-6  ...                                None

如果您有更多元素，另一种选择是直接在您想要的嵌套元素上使用 flatten_json 和 json_normalize 函数.

The other option if you had a lot more elements would be to use the flatten_json and json_normalize functions directly on the nested element you want.

这篇关于规范化复杂的嵌套 JSON 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

规范化复杂的嵌套 JSON 文件 [英] Normalize a complex nested JSON file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

规范化复杂的嵌套 JSON 文件 [英] Normalize a complex nested JSON file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭