使用jq,如何根据对象属性的值将对象的JSON流拆分为单独的文件? [英] Using jq, how can I split a JSON stream of objects into separate files based on the values of an object property?

查看:92
本文介绍了使用jq,如何根据对象属性的值将对象的JSON流拆分为单独的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为input.json的超大文件(压缩了20GB以上),其中包含JSON对象流,如下所示:

I have a very large file (20GB+ compressed) called input.json containing a stream of JSON objects as follows:

{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typea"
}
{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typea"
}
{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typeb"
}

我想根据该文件的type属性将其拆分为文件:typea.jsontypeb.json等,每个文件都包含它们自己的json对象流,而这些对象仅具有匹配的type属性.

I want to split this file into files dependent on their type property: typea.json, typeb.json etc., each containing their own stream of json objects that only have the matching type property.

我已经设法解决了较小文件的问题,但是对于如此大的文件,我的AWS实例上的内存不足.我希望降低内存使用量,所以我知道我需要使用--stream,但是我正在努力寻找如何实现这一目标的方法.

I've managed to solve this problem for smaller files, however with such a large file I run out of memory on my AWS instance. As I wish to keep memory usage down, I understand I need to use --stream but I'm struggling to see how I can achieve this.

cat input.json | jq -c --stream 'select(.[0][0]=="type") | .[1]'将为我返回每个类型属性的值,但是如何使用它来过滤对象?

cat input.json | jq -c --stream 'select(.[0][0]=="type") | .[1]' will return me the values of each of the type properties, but how do I use this to then filter the objects?

任何帮助将不胜感激!

推荐答案

假设文件中的JSON对象相对较小(不超过几MB),则无需使用(相当复杂)- -stream"命令行选项,当输入是(或包括)单个笨拙的JSON实体时,通常需要使用此选项.

Assuming the JSON objects in the file are relatively small (none more than a few MB), you won't need to use the (rather complex) "--stream" command-line option, which is mainly needed when the input is (or includes) a single humungous JSON entity.

但是仍然有几种选择.主要内容在将JSON文件拆分为单独的文件中进行了说明,这是一种多遍方法(对jq进行N或(N + 1)次调用,其中N是输出文件的数量),并且该方法仅涉及一次对jq的调用,然后再调用诸如awk执行实际的分区到文件中.每种方法都有其优缺点,但是如果可以接受N次读取输入文件的情况,那么第一种方法可能会更好.

There are however several choices still to be made. The main ones are described at Split a JSON file into separate files, these being a multi-pass approach (N or (N+1) calls to jq, where N is the number of output files), and an approach that involves just one call to jq, followed by a call to a program such as awk to perform the actual partitioning into files. Each approach has its pros and cons, but if reading the input file N times is acceptable, then the first approach might be better.

要估算所需的总计算资源,最好测量运行jq empty input.json

To estimate the total computational resources that will be required, it would probably be a good idea to measure the resources used by running jq empty input.json

(从您的简短撰写看来,您遇到存储问题的主要原因是文件解压缩.)

(From your brief writeup, it sounds like the memory issue you've run into results primarily from the unzipping of the file.)

这篇关于使用jq,如何根据对象属性的值将对象的JSON流拆分为单独的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆