在Python中将嵌套的JSON转换为CSV文件 [英] Convert nested JSON to CSV file in Python

查看:418
本文介绍了在Python中将嵌套的JSON转换为CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题已经问过很多次了.我尝试了几种解决方案,但无法解决自己的问题.

I know this question has been asked many times. I tried several solutions but I couldn't solve my problem.

我有一个很大的嵌套JSON文件(1.4GB),我想将其放平,然后将其转换为CSV文件.

I have a large nested JSON file (1.4GB) and I would like to make it flat and then convert it to a CSV file.

JSON结构如下:

{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

我知道使用pandas模块很容易做到这一点,但我并不熟悉.

I know that this is easy to accomplish with pandas module but I am not familiar with it.

已编辑

所需的输出应该是这样的:

The desired output should be something like this:

company_number, address_line_1, locality, country_of_residence, kind,

12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control

请注意,这只是简短版本.输出应具有所有字段.

Note that this is just the short version. The output should have all the fields.

推荐答案

对于给定的JSON数据,您可以通过解析JSON结构以仅返回所有叶节点的列表来实现.

For the JSON data you have given, you could do this by parsing the JSON structure to just return a list of all the leaf nodes.

这假设您的结构在整个过程中都是一致的,如果每个条目可以具有不同的字段,请参见第二种方法.

This assumes that your structure is consistent throughout, if each entry can have different fields, see the second approach.

例如:

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = []
        for i in item.keys():
            leaves.extend(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = []
        for i in item:
            leaves.extend(get_leaves(i, key))
        return leaves
    else:
        return [(key, item)]


with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    write_header = True

    for entry in json.load(f_input):
        leaf_entries = sorted(get_leaves(entry))

        if write_header:
            csv_output.writerow([k for k, v in leaf_entries])
            write_header = False

        csv_output.writerow([v for k, v in leaf_entries])

如果您的JSON数据是给定格式的条目列表,那么您应该获得如下输出:

If your JSON data is a list of entries in the format you have given, then you should get output as follows:

address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,year
Address 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977


如果每个条目可以包含不同(或可能丢失)的字段,那么更好的方法是使用DictWriter.在这种情况下,将需要处理所有条目以确定可能的fieldnames的完整列表,以便可以写入正确的标头.


If each entry can contain different (or possibly missing) fields, then a better approach would be to use a DictWriter. In this case, all of the entries would need to be processed to determine the complete list of possible fieldnames so that the correct header can be written.

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = {}
        for i in item.keys():
            leaves.update(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = {}
        for i in item:
            leaves.update(get_leaves(i, key))
        return leaves
    else:
        return {key : item}


with open('json.txt') as f_input:
    json_data = json.load(f_input)

# First parse all entries to get the complete fieldname list
fieldnames = set()

for entry in json_data:
    fieldnames.update(get_leaves(entry).keys())

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
    csv_output.writeheader()
    csv_output.writerows(get_leaves(entry) for entry in json_data)

这篇关于在Python中将嵌套的JSON转换为CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆