在 Python 中将嵌套的 JSON 转换为 CSV 文件 [英] Convert nested JSON to CSV file in Python

查看:50
本文介绍了在 Python 中将嵌套的 JSON 转换为 CSV 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题已经被问过很多次了.我尝试了多种解决方案,但无法解决我的问题.

我有一个大的嵌套 JSON 文件 (1.4GB),我想将其弄平,然后将其转换为 CSV 文件.

JSON 结构是这样的:

<代码>{"company_number": "12345678",数据": {地址": {"address_line_1": "地址 1","locality": "Henley-On-Thames","postal_code": "RG9 1DP","前提": "161",地区":牛津郡"},"country_of_residence": "英格兰",出生日期": {月":2,年份":1977},"etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00","kind": "具有重大控制权的个人",链接":{"self": "/company/12345678/persons-with-important-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"},"name": "约翰·M·史密斯",名称元素":{"forename": "约翰","middle_name": "M","姓氏": "史密斯","title": "夫人"},"国籍": "越南人",natures_of_control":[50% 至 75% 的股份所有权"],notified_on":2016-04-06"}}

我知道使用 pandas 模块很容易实现,但我不熟悉它.

已编辑

所需的输出应该是这样的:

company_number, address_line_1, locality, country_of_residence, kind,12345678,地址 1,Henley-On-Thamed,英格兰,具有重大控制权的个人

请注意,这只是简短版本.输出应包含所有字段.

解决方案

请向下滚动以获得更新、更快的解决方案

这是一个较旧的问题,但我整晚都在为类似的情况苦苦挣扎以获得满意的结果,我想出了这个:

导入json进口大熊猫def cross_join(左,右):return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)def json_to_dataframe(data_in):def to_frame(data, prev_key=None):如果是实例(数据,字典):df = pandas.DataFrame()对于关键数据:df = cross_join(df, to_frame(data[key], prev_key + '.' + key))elif isinstance(数据,列表):df = pandas.DataFrame()对于我在范围内(len(数据)):df = pandas.concat([df, to_frame(data[i], prev_key)])别的:df = pandas.DataFrame({prev_key[1:]: [data]})返回 df返回到_frame(data_in)如果 __name__ == '__main__':使用 open('somefile') 作为 json_file:json_data = json.load(json_file)df = json_to_dataframe(json_data)df.to_csv('data.csv', mode='w')

说明:

cross_join 函数是我发现的一种制作笛卡尔积的巧妙方法.(信用:此处)

json_to_dataframe 函数执行逻辑,使用 Pandas 数据帧.就我而言,json 嵌套很深,我想将字典 key:value 对拆分为列,但是 我想将列表转换为列的行 -- 因此是 concat - 然后我将其与上层交叉连接,从而将记录数相乘,以便列表中的每个值都有自己的行,而前面的列是相同的.

递归创建与下面一个交叉连接的堆栈,直到返回最后一个.

然后使用表格格式的数据框,使用 df.to_csv()" 数据框对象方法很容易转换为 CSV.

这应该适用于深度嵌套的 JSON,能够通过上述逻辑将所有数据规范化为行.

我希望有一天这会对某人有所帮助.只是想回馈这个很棒的社区.

---------------------------------------------------------------------------------------------

后期新解决方案

我回到这个问题,因为虽然数据框选项有点工作,但应用程序需要几分钟来解析不是那么大的 JSON 数据.因此,我想到做数据帧所做的事情,但我自己:

from copy import deepcopy进口大熊猫def cross_join(左,右):new_rows = [] if right else left对于 left_row 在左边:对于 right_row 在右边:temp_row = deepcopy(left_row)对于 key,right_row.items() 中的值:temp_row[key] = 值new_rows.append(deepcopy(temp_row))返回新行def flatten_list(数据):对于数据中的元素:如果是实例(元素,列表):从 flatten_list(elem) 产生别的:产量要素def json_to_dataframe(data_in):def flatten_json(data, prev_heading=''):如果是实例(数据,字典):行 = [{}]对于 key,data.items() 中的值:rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))elif isinstance(数据,列表):行 = []对于我在范围内(len(数据)):[rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))]别的:行 = [{prev_heading[1:]: 数据}]返回行返回 pandas.DataFrame(flatten_json(data_in))如果 __name__ == '__main__':json_data = {id":0001",类型":甜甜圈",名称":蛋糕",ppu":0.55,击球手":{面糊":[{id":1001",类型":常规"},{id":1002",类型":巧克力"},{id":1003",类型":蓝莓"},{id":1004",类型":恶魔的食物"}]},打顶":[{id":5001",类型":无"},{id":5002",类型":釉面"},{id":5005",类型":糖"},{id":5007",类型":糖粉"},{id":5006",类型":巧克力"},{id":5003",类型":巧克力"},{id":5004",类型":Maple"}],某事":[]}df = json_to_dataframe(json_data)打印(df)

输出:

 id type name ppubatters.batter.idbatters.batter.type topping.id topping.type0 0001 甜甜圈蛋糕 0.55 1001 普通 5001 无1 0001 甜甜圈蛋糕 0.55 1001 普通 5002 釉面2 0001 甜甜圈蛋糕 0.55 1001 普通 5005 糖3 0001 甜甜圈蛋糕 0.55 1001 普通 5007 糖粉4 0001 甜甜圈蛋糕 0.55 1001 普通 5006 洒巧克力5 0001 甜甜圈蛋糕 0.55 1001 普通 5003 巧克力6 0001 甜甜圈蛋糕 0.55 1001 普通 5004 枫木7 0001 甜甜圈蛋糕 0.55 1002 巧克力 5001 无8 0001 甜甜圈蛋糕 0.55 1002 巧克力 5002 釉面9 0001 甜甜圈蛋糕 0.55 1002 巧克力 5005 糖10 0001 甜甜圈蛋糕 0.55 1002 巧克力 5007 糖粉11 0001 甜甜圈蛋糕 0.55 1002 巧克力 5006 洒巧克力12 0001 甜甜圈蛋糕 0.55 1002 巧克力 5003 巧克力13 0001 甜甜圈蛋糕 0.55 1002 巧克力 5004 枫木14 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5001 无15 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5002 釉面16 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5005 糖17 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5007 糖粉18 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5006 洒巧克力19 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5003 巧克力20 0001 甜甜圈蛋糕 0.55 1003 蓝莓 5004 枫木21 0001 甜甜圈蛋糕 0.55 1004 魔鬼食品 5001 无22 0001 甜甜圈蛋糕 0.55 1004 魔鬼食品 5002 釉面23 0001 甜甜圈蛋糕 0.55 1004 魔鬼食品 5005 糖24 0001 甜甜圈蛋糕 0.55 1004 魔鬼食品 5007 糖粉25 0001 甜甜圈蛋糕 0.55 1004 魔鬼的食物 5006 洒巧克力26 0001 甜甜圈蛋糕 0.55 1004 魔鬼食品 5003 巧克力27 0001 甜甜圈蛋糕 0.55 1004 魔鬼的食物 5004 枫树

根据上述内容,cross_join 函数与数据帧解决方案中的功能几乎相同,但没有数据帧,因此速度更快.

我添加了 flatten_list 生成器,因为我想确保 JSON 数组都很好且扁平化,然后作为单个字典列表提供,其中包含一次迭代中的前一个键,然后分配给每个列表的值.这几乎模仿了这种情况下的 pandas.concat 行为.

主函数json_to_dataframe中的逻辑与之前相同.需要改变的只是将数据帧执行的操作作为编码函数.

此外,在 dataframes 解决方案中,我没有将前一个标题附加到嵌套对象中,但除非您 100% 确定列名没有冲突,否则它几乎是强制性的.

我希望这会有所帮助:)

EDIT:修改了cross_join函数来处理嵌套列表为空的情况,基本保持之前的结果集不变.即使在示例 JSON 数据中添加空的 JSON 列表后,输出也没有改变.谢谢@Nazmus Sakib 指出.

I know this question has been asked many times. I tried several solutions but I couldn't solve my problem.

I have a large nested JSON file (1.4GB) and I would like to make it flat and then convert it to a CSV file.

The JSON structure is like this:

{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

I know that this is easy to accomplish with pandas module but I am not familiar with it.

EDITED

The desired output should be something like this:

company_number, address_line_1, locality, country_of_residence, kind,

12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control

Note that this is just the short version. The output should have all the fields.

解决方案

Please scroll down for the newer, faster solution

This is an older question, but I struggled the entire night to get a satisfactory result for a similar situation, and I came up with this:

import json
import pandas

def cross_join(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)

def json_to_dataframe(data_in):
    def to_frame(data, prev_key=None):
        if isinstance(data, dict):
            df = pandas.DataFrame()
            for key in data:
                df = cross_join(df, to_frame(data[key], prev_key + '.' + key))
        elif isinstance(data, list):
            df = pandas.DataFrame()
            for i in range(len(data)):
                df = pandas.concat([df, to_frame(data[i], prev_key)])
        else:
            df = pandas.DataFrame({prev_key[1:]: [data]})
        return df
    return to_frame(data_in)

if __name__ == '__main__':
    with open('somefile') as json_file:
        json_data = json.load(json_file)

    df = json_to_dataframe(json_data)
    df.to_csv('data.csv', mode='w')

Explanation:

The cross_join function is a neat way I found to do a cartesian product. (credit: here)

The json_to_dataframe function does the logic, using pandas dataframes. In my case, the json was deeply nested, and I wanted to split dictionary key:value pairs into columns, but the lists I wanted to transform into rows for a column -- hence the concat -- which I then cross join with the upper level, thus multiplying the records number so that each value from the list has its own row, while the previous columns are identical.

The recursiveness creates stacks that cross join with the one below, until the last one is returned.

Then with the dataframe in a table format, it's easy to convert to CSV with the "df.to_csv()" dataframe object method.

This should work with deeply nested JSON, being able to normalize all of it into rows by the logic described above.

I hope this will help someone, someday. Just trying to give back to this awesome community.

---------------------------------------------------------------------------------------------

LATER EDIT: NEW SOLUTION

I'm coming back to this as while the dataframe option kinda worked, it took the app minutes to parse not so large JSON data. Therefore I thought of doing what the dataframes do, but by myself:

from copy import deepcopy
import pandas


def cross_join(left, right):
    new_rows = [] if right else left
    for left_row in left:
        for right_row in right:
            temp_row = deepcopy(left_row)
            for key, value in right_row.items():
                temp_row[key] = value
            new_rows.append(deepcopy(temp_row))
    return new_rows


def flatten_list(data):
    for elem in data:
        if isinstance(elem, list):
            yield from flatten_list(elem)
        else:
            yield elem


def json_to_dataframe(data_in):
    def flatten_json(data, prev_heading=''):
        if isinstance(data, dict):
            rows = [{}]
            for key, value in data.items():
                rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))
        elif isinstance(data, list):
            rows = []
            for i in range(len(data)):
                [rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))]
        else:
            rows = [{prev_heading[1:]: data}]
        return rows

    return pandas.DataFrame(flatten_json(data_in))


if __name__ == '__main__':
    json_data = {
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
        "batters":
            {
                "batter":
                    [
                        {"id": "1001", "type": "Regular"},
                        {"id": "1002", "type": "Chocolate"},
                        {"id": "1003", "type": "Blueberry"},
                        {"id": "1004", "type": "Devil's Food"}
                    ]
            },
        "topping":
            [
                {"id": "5001", "type": "None"},
                {"id": "5002", "type": "Glazed"},
                {"id": "5005", "type": "Sugar"},
                {"id": "5007", "type": "Powdered Sugar"},
                {"id": "5006", "type": "Chocolate with Sprinkles"},
                {"id": "5003", "type": "Chocolate"},
                {"id": "5004", "type": "Maple"}
            ],
        "something": []
    }
    df = json_to_dataframe(json_data)
    print(df)

OUTPUT:

      id   type  name   ppu batters.batter.id batters.batter.type topping.id              topping.type
0   0001  donut  Cake  0.55              1001             Regular       5001                      None
1   0001  donut  Cake  0.55              1001             Regular       5002                    Glazed
2   0001  donut  Cake  0.55              1001             Regular       5005                     Sugar
3   0001  donut  Cake  0.55              1001             Regular       5007            Powdered Sugar
4   0001  donut  Cake  0.55              1001             Regular       5006  Chocolate with Sprinkles
5   0001  donut  Cake  0.55              1001             Regular       5003                 Chocolate
6   0001  donut  Cake  0.55              1001             Regular       5004                     Maple
7   0001  donut  Cake  0.55              1002           Chocolate       5001                      None
8   0001  donut  Cake  0.55              1002           Chocolate       5002                    Glazed
9   0001  donut  Cake  0.55              1002           Chocolate       5005                     Sugar
10  0001  donut  Cake  0.55              1002           Chocolate       5007            Powdered Sugar
11  0001  donut  Cake  0.55              1002           Chocolate       5006  Chocolate with Sprinkles
12  0001  donut  Cake  0.55              1002           Chocolate       5003                 Chocolate
13  0001  donut  Cake  0.55              1002           Chocolate       5004                     Maple
14  0001  donut  Cake  0.55              1003           Blueberry       5001                      None
15  0001  donut  Cake  0.55              1003           Blueberry       5002                    Glazed
16  0001  donut  Cake  0.55              1003           Blueberry       5005                     Sugar
17  0001  donut  Cake  0.55              1003           Blueberry       5007            Powdered Sugar
18  0001  donut  Cake  0.55              1003           Blueberry       5006  Chocolate with Sprinkles
19  0001  donut  Cake  0.55              1003           Blueberry       5003                 Chocolate
20  0001  donut  Cake  0.55              1003           Blueberry       5004                     Maple
21  0001  donut  Cake  0.55              1004        Devil's Food       5001                      None
22  0001  donut  Cake  0.55              1004        Devil's Food       5002                    Glazed
23  0001  donut  Cake  0.55              1004        Devil's Food       5005                     Sugar
24  0001  donut  Cake  0.55              1004        Devil's Food       5007            Powdered Sugar
25  0001  donut  Cake  0.55              1004        Devil's Food       5006  Chocolate with Sprinkles
26  0001  donut  Cake  0.55              1004        Devil's Food       5003                 Chocolate
27  0001  donut  Cake  0.55              1004        Devil's Food       5004                     Maple

As per what the above does, well, the cross_join function does pretty much the same thing as in the dataframe solution, but without dataframes, thus being faster.

I added the flatten_list generator as I wanted to make sure that the JSON arrays are all nice and flattened, then provided as a single list of dictionaries comprising of the previous key from one iteration before assigned to each of the list's values. This pretty much mimics the pandas.concat behaviour in this case.

The logic in the main function, json_to_dataframe is then the same as before. All that needed to change was having the operations performed by dataframes as coded functions.

Also, in the dataframes solution I was not appending the previous heading to the nested object, but unless you are 100% sure you do not have conflicts in column names, then it is pretty much mandatory.

I hope this helps :).

EDIT: Modified the cross_join function to deal with the case when a nested list is empty, basically maintaining the previous result set unmodified. The output is unchanged even after adding the empty JSON list in the example JSON data. Thank you, @Nazmus Sakib for pointing it out.

这篇关于在 Python 中将嵌套的 JSON 转换为 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆