将一个较大的json文件拆分为多个较小的文件 [英] Split a large json file into multiple smaller files

查看:228
本文介绍了将一个较大的json文件拆分为多个较小的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的JSON文件,大约500万条记录和大约32GB的文件大小,需要将它们加载到我们的Snowflake数据仓库中.我需要将此文件分解为每个文件约200k条记录(约1.25GB)的块.我想在Node.JS或Python中执行此操作,以部署到AWS Lambda函数,不幸的是,我还没有编写任何代码.我有C#和很多SQL的经验,并且学习node和python都在我的工作清单中,所以为什么不直接潜水呢?!

I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up into chunks of about 200k records (about 1.25GB) per file. I'd like to do this in either Node.JS or Python for deployment to an AWS Lambda function, unfortunately I haven't coded in either, yet. I have C# and a lot of SQL experience, and learning both node and python are on my to do list, so why not dive right in, right!?

我的第一个问题是哪种语言能更好地实现此功能?Python还是Node.JS?"

My first question is "Which language would better serve this function? Python, or Node.JS?"

我知道我不想将整个JSON文件读入内存(甚至不输出 small 文件).我需要能够基于记录计数(200k)将它流"入到新文件中,正确关闭json对象,并继续进入另一个200k的新文件中,等等.我知道Node可以做到这一点,但是如果Python也可以做到这一点,我觉得快速开始将要用于我将要做的其他ETL事情变得更加容易.

I know I don't want to read this entire JSON file into memory (or even the output smaller file). I need to be able to "stream" it in and out into the new file based on a record count (200k), properly close up the json objects, and continue into a new file for another 200k, and so on. I know Node can do this, but if Python can also do this, I feel like it would be easier to quickly start using for other ETL stuff I'll be doing soon.

我的第二个问题是根据上面的建议,您还可以建议我需要/导入哪些模块以帮助我入门吗?主要是因为它与不将整个json文件拉入内存有关?也许是一些技巧,窍门,或您会怎么做?如果您真的很慷慨,则可以使用一些代码示例来帮助我深入探讨此问题?

My second question is "Based on your recommendation above, can you also recommend what modules I should require/import to help me get started? Primarily as it relates to not pulling the entire json file into memory? Maybe some tips, tricks, or 'How would you do it's? And if you're feeling really generous, some code example to help push me into the deep end on this?

我无法包含JSON数据样本,因为它包含个人信息.但是我可以提供JSON模式...

I can't include a sample of the JSON data, as it contains personal information. But I can provide the JSON schema ...

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "items": {
    "properties": {
      "activities": {
        "properties": {
          "activity_id": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "frontlineorg_id": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "import_id": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "is_source": {
            "items": {
              "type": "boolean"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "address": {
        "properties": {
          "city": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "congress_dist_name": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "congress_dist_number": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "congress_end_yr": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "congress_number": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "congress_start_yr": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "county": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "formatted": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "latitude": {
            "items": {
              "type": "number"
            },
            "type": "array"
          },
          "longitude": {
            "items": {
              "type": "number"
            },
            "type": "array"
          },
          "number": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "observes_dst": {
            "items": {
              "type": "boolean"
            },
            "type": "array"
          },
          "post_directional": {
            "items": {
              "type": "null"
            },
            "type": "array"
          },
          "pre_directional": {
            "items": {
              "type": "null"
            },
            "type": "array"
          },
          "school_district": {
            "items": {
              "properties": {
                "school_dist_name": {
                  "items": {
                    "type": "string"
                  },
                  "type": "array"
                },
                "school_dist_type": {
                  "items": {
                    "type": "string"
                  },
                  "type": "array"
                },
                "school_grade_high": {
                  "items": {
                    "type": "string"
                  },
                  "type": "array"
                },
                "school_grade_low": {
                  "items": {
                    "type": "string"
                  },
                  "type": "array"
                },
                "school_lea_code": {
                  "items": {
                    "type": "integer"
                  },
                  "type": "array"
                }
              },
              "type": "object"
            },
            "type": "array"
          },
          "secondary_number": {
            "items": {
              "type": "null"
            },
            "type": "array"
          },
          "secondary_unit": {
            "items": {
              "type": "null"
            },
            "type": "array"
          },
          "state": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "state_house_dist_name": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "state_house_dist_number": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "state_senate_dist_name": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "state_senate_dist_number": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "street": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suffix": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "timezone": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "utc_offset": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "zip": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "age": {
        "type": "integer"
      },
      "anniversary": {
        "properties": {
          "date": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "baptism": {
        "properties": {
          "church_id": {
            "type": "null"
          },
          "date": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "birth_dd": {
        "type": "integer"
      },
      "birth_mm": {
        "type": "integer"
      },
      "birth_yyyy": {
        "type": "integer"
      },
      "church_attendance": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "likelihood": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "cohabiting": {
        "properties": {
          "confidence": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "likelihood": {
            "items": {
              "type": "null"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "dating": {
        "properties": {
          "bool": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "divorced": {
        "properties": {
          "bool": {
            "items": {
              "type": "null"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "likelihood_considering": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "education": {
        "properties": {
          "est_level": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "email": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "is_work_school": {
            "items": {
              "type": "boolean"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "engaged": {
        "properties": {
          "insert_datetime_utc": {
            "type": "null"
          },
          "likelihood": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "est_income": {
        "properties": {
          "est_level": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "ethnicity": {
        "type": "string"
      },
      "first_name": {
        "type": "string"
      },
      "formatted_birthdate": {
        "type": "string"
      },
      "gender": {
        "type": "string"
      },
      "head_of_household": {
        "properties": {
          "bool": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "home_church": {
        "properties": {
          "church_id": {
            "type": "null"
          },
          "group_participant": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "is_coaching": {
            "type": "null"
          },
          "is_giving": {
            "type": "null"
          },
          "is_serving": {
            "type": "null"
          },
          "membership_date": {
            "type": "null"
          },
          "regular_attendee": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "hub_poid": {
        "type": "integer"
      },
      "insert_datetime_utc": {
        "type": "string"
      },
      "ip_address": {
        "properties": {
          "insert_datetime_utc": {
            "type": "null"
          },
          "string": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "last_name": {
        "type": "string"
      },
      "marriage_segment": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "married": {
        "properties": {
          "bool": {
            "items": {
              "type": "boolean"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "middle_name": {
        "type": "string"
      },
      "miscellaneous": {
        "properties": {
          "attribute": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "value": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "name_suffix": {
        "type": "null"
      },
      "name_title": {
        "type": "null"
      },
      "newlywed": {
        "properties": {
          "bool": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "parent": {
        "properties": {
          "bool": {
            "items": {
              "type": "boolean"
            },
            "type": "array"
          },
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "likelihood_expecting": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "person_id": {
        "type": "integer"
      },
      "phone": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "number": {
            "items": {
              "type": "integer"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "type": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "property_rights": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "psychographic_cluster": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "religion": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "religious_segment": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      },
      "separated": {
        "properties": {
          "bool": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "significant_other": {
        "properties": {
          "first_name": {
            "type": "null"
          },
          "insert_datetime_utc": {
            "type": "null"
          },
          "last_name": {
            "type": "null"
          },
          "middle_name": {
            "type": "null"
          },
          "name_suffix": {
            "type": "null"
          },
          "name_title": {
            "type": "null"
          },
          "suppressed_datetime_utc": {
            "type": "null"
          }
        },
        "type": "object"
      },
      "suppressed_datetime_utc": {
        "type": "string"
      },
      "target_group": {
        "properties": {
          "insert_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "string": {
            "items": {
              "type": "string"
            },
            "type": "array"
          },
          "suppressed_datetime_utc": {
            "items": {
              "type": "string"
            },
            "type": "array"
          }
        },
        "type": "object"
      }
    },
    "type": "object"
  },
  "type": "array"
}

推荐答案

回答关于Python还是Node更好地完成任务的问题是一种意见,我们不允许在Stack Overflow上发表意见.您必须自己决定要拥有什么经验,以及要使用什么东西-Python或Node.

Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.

如果您使用Node,那么有一些模块可以帮助您完成该任务,并且可以进行流JSON解析.例如.这些模块:

If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:

  • https://www.npmjs.com/package/JSONStream
  • https://www.npmjs.com/package/stream-json
  • https://www.npmjs.com/package/json-stream

如果您使用Python,那么这里也有流式JSON解析器:

If you go with Python, there are streaming JSON parsers here as well:

  • https://github.com/kashifrazzaqui/json-streamer
  • https://github.com/danielyule/naya
  • http://www.enricozini.org/blog/2011/tips/python-stream-json/

这篇关于将一个较大的json文件拆分为多个较小的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆