PySpark:如何从 spark 数据框创建嵌套的 JSON? [英] PySpark: How to create a nested JSON from spark data frame?

查看:46
本文介绍了PySpark:如何从 spark 数据框创建嵌套的 JSON?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从我的 spark 数据帧创建一个嵌套的 json,它具有以下结构的数据.下面的代码正在创建一个带有键和值的简单 json.你能帮忙吗

df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

更新1:根据@MaxU 的回答,我将 spark 数据框转换为 pandas 并使用了 group by.它将最后两个字段放入嵌套数组中.我如何首先将类别和计数放在嵌套数组中,然后在该数组中放入子类别和计数.

示例文本数据:

Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count供应商 1,10,类别 1,4,子类别 1,1供应商 1,10,类别 1,4,子类别 2,2供应商 1,10,类别 1,4,子类别 3,3供应商 1,10,类别 1,4,子类别 4,4j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False).apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r')).reset_index().rename(columns={0:'subcategories'}).to_json(orient='记录'))

<代码>[{"vendor_name": "供应商 1",计数":10,类别":[{"name": "类别 1",计数":4,子类别":[{"name": "子类别 1",计数":1},{"name": "子类别 2",计数":1},{"name": "子类别 3",计数":1},{"name": "子类别 4",计数":1}]}]

解决方案

在 python/pandas 中最简单的方法是使用一系列使用 groupby 的嵌套生成器,我认为:

def split_df(df):for (vendor, count), df.groupby(["Vendor_Name", "count"]) 中的 df_vendor:屈服 {vendor_name":供应商,计数":计数,类别":列表(split_category(df_vendor))}def split_category(df_vendor):for (category, count), df_vendor.groupby 中的 df_category([类别",类别_计数"]):屈服 {名称":类别,计数":计数,子类别":列表(split_subcategory(df_category)),}def split_subcategory(df_category):对于 df.itertuples() 中的行:产量 {"name": row.Subcategory, "count": row.Subcategory_Count}列表(split_df(df))

<块引用><预><代码>[{"vendor_name": "供应商 1",计数":10,类别":[{"name": "类别 1",计数":4,子类别":[{"name": "子类别 1", "count": 1},{"name": "子类别 2", "count": 2},{"name": "子类别 3", "count": 3},{"name": "子类别 4", "count": 4},],}],}]

要将其导出到 json,您需要一种导出 np.int64

的方法

I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help

df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

Update1: As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside that array i want to put subcategory and count.

Sample text data:

Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4

j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False)
             .apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r'))
             .reset_index()
             .rename(columns={0:'subcategories'})
             .to_json(orient='records'))

[{
        "vendor_name": "Vendor 1",
        "count": 10,
        "categories": [{
            "name": "Category 1",
            "count": 4,
            "subCategories": [{
                    "name": "Sub Category 1",
                    "count": 1
                },
                {
                    "name": "Sub Category 2",
                    "count": 1
                },
                {
                    "name": "Sub Category 3",
                    "count": 1
                },
                {
                    "name": "Sub Category 4",
                    "count": 1
                }
            ]
        }]

解决方案

The easiest way to do this in python/pandas would be to use a series of nested generators using groupby I think:

def split_df(df):
    for (vendor, count), df_vendor in df.groupby(["Vendor_Name", "count"]):
        yield {
            "vendor_name": vendor,
            "count": count,
            "categories": list(split_category(df_vendor))
        }

def split_category(df_vendor):
    for (category, count), df_category in df_vendor.groupby(
        ["Categories", "Category_Count"]
    ):
        yield {
            "name": category,
            "count": count,
            "subCategories": list(split_subcategory(df_category)),
        }

def split_subcategory(df_category):
    for row in df.itertuples():
        yield {"name": row.Subcategory, "count": row.Subcategory_Count}

list(split_df(df))

[
    {
        "vendor_name": "Vendor1",
        "count": 10,
        "categories": [
            {
                "name": "Category 1",
                "count": 4,
                "subCategories": [
                    {"name": "Sub Category 1", "count": 1},
                    {"name": "Sub Category 2", "count": 2},
                    {"name": "Sub Category 3", "count": 3},
                    {"name": "Sub Category 4", "count": 4},
                ],
            }
        ],
    }
]

To export this to json, you'll need a way to export the np.int64

这篇关于PySpark:如何从 spark 数据框创建嵌套的 JSON?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆