使用多线程/加载大型JSON文件/ [英] Load a large JSON file using multi threading/

查看:83
本文介绍了使用多线程/加载大型JSON文件/的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加载3 GB的大JSON文件.目前,使用JQ实用程序,我可以在近40分钟内加载整个文件.现在,我想知道如何在JQ中使用并行/多线程方法,以便在更短的时间内完成该过程.我正在使用v1.5

I am trying to load a large 3 GB JSON file. Currently, with JQ utility I can load the entire file in nearly 40 mins. Now, I want to know how I can use parallelism/multi threading approach in JQ in order to complete the process in less amount of time. I am using v1.5

使用的命令:

JQ.exe -r -s "map(.\"results\" | map({\"ID\": (((.\"body\"?.\"party\"?.\"xrefs\"?.\"xref\"//[] | map(select(ID))[]?.\"id\"?))//null), \"Name\": (((.\"body\"?.\"party\"?.\"general-info\"?.\"full-name\"?))//null)} | [(.\"ID\"//\"\"|tostring), (.\"Name\"//\"\"|tostring)])) | add[] | join(\"~\")" "\C:\InputFile.txt" >"\C:\OutputFile.txt"

我的数据:

{
  "results": [
    {
      "_id": "0000001",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2011-02-14T08:21:51.000-05:00",
            "full-name": "Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades",
            "status": "ACTIVE",
            "last-update-user": "TS42922",
            "create-date": "2011-02-14T08:21:51.000-05:00",
            "classifications": {
              "classification": [
                {
                  "code": "PENS"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "LOCCU1",
                "id": "X00893X"
              },
              {
                "type": "ID",
                "id": "1012227139"
              }
            ]
          }
        }
      }
    },
    {
      "_id": "000002",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2015-05-21T15:10:45.174-04:00",
            "full-name": "Innova Capital Sp zoo",
            "status": "ACTIVE",
            "last-update-user": "jw74592",
            "create-date": "1994-08-31T00:00:00.000-04:00",
            "classifications": {
              "classification": [
                {
                  "code": "CORP"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "ULTDUN",
                "id": "144349875"
              },
              {
                "type": "AVID",
                "id": "6098743"
              },
              {
                "type": "LOCCU1",
                "id": "1001210218"
              },
              {
                "type": "ID",
                "id": "1001210218"
              },
              {
                "type": "BLMBRG",
                "id": "10009050"
              },
              {
                "type": "REG_CO",
                "id": "0000068508"
              },
              {
                "type": "SMCI",
                "id": "13159"
              }
            ]
          }
        }
      }
    }
  ]
}

有人可以帮助我在v1.5中使用哪个命令以实现并行/多线程.

Can someone please help me which command I need to use in v1.5 in order to achieve parallelism/multithreading.

推荐答案

这里是一种流传输方法,它假定您的3GB数据文件位于data.json中,并且以下过滤器位于filter1.jq中:

Here is a streaming approach which assumes your 3GB data file is in data.json and the following filter is in filter1.jq:

  select(length==2)
| . as [$p, $v]
| {r:$p[1]}
| if   $p[2:6] == ["body","party","general-info","full-name"]       then .name = $v
  elif $p[2:6] == ["body","party","xrefs","xref"] and $p[7] == "id" then .id   = $v
  else  empty
  end      

使用以下命令运行jq

When you run jq with

$ jq -M -c --stream -f filter1.jq data.json

jq会生成结果流,其中包含您所需的最少详细信息

jq will produce a stream of results with minimal details you need

{"r":0,"name":"Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades"}
{"r":0,"id":"X00893X"}
{"r":0,"id":"1012227139"}
{"r":1,"name":"Innova Capital Sp zoo"}
{"r":1,"id":"144349875"}
{"r":1,"id":"6098743"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"10009050"}
{"r":1,"id":"0000068508"}
{"r":1,"id":"13159"}

,您可以使用第二个filter2.jq来将其转换为所需的格式:

which you can convert to your desired format by using a second filter2.jq:

foreach .[] as $i (
     {c: null, r:null, id:null, name:null}

   ; .c = $i
   | if .r != .c.r then .id=null | .name=null | .r=.c.r else . end   # control break
   | .id   = if .c.id == null   then .id   else .c.id   end
   | .name = if .c.name == null then .name else .c.name end

   ; [.id, .name]
   | if contains([null]) then empty else . end
   | join("~")
)

使用

$ jq -M -c --stream -f filter1.jq data.json | jq -M -s -r -f filter2.jq

产生

X00893X~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
1012227139~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
144349875~Innova Capital Sp zoo
6098743~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
10009050~Innova Capital Sp zoo
0000068508~Innova Capital Sp zoo
13159~Innova Capital Sp zoo

这可能就是您仅需使用两个 jq 进程所需要的.如果您需要更多的并行性,则可以使用记录号(r)来对数据进行分区并并行处理分区.例如,如果您将中间输出保存到temp.json文件

This might be all you need using just two jq processes. If you need more parallelism you could use the record number (r) as to partition the data and process the partitions in parallel. For example, if you save the intermediate output into a temp.json file

$ jq -M -c --stream -f filter1.jq data.json > temp.json

然后,您可以与诸如

之类的过滤器并行处理temp.json

then you could process temp.json in parallel with filters such as

$ jq -M 'select(0==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result0.out &
$ jq -M 'select(1==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result1.out &
$ jq -M 'select(2==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result2.out &

,并在必要时在最后将分区连接为单个结果.此示例使用3个分区,但是如果您需要更多的并行性,则可以轻松地将此方法扩展到任意数量的分区.

and concatenate your partitions into a single result at the end if necessary. This example uses 3 partitions but you could easily extend this approach to any number of partitions if you need more parallelism.

GNU并行也是一个不错的选择.如 JQ Cookbook 中所述,

GNU parallel is also a good option. As mentioned in the JQ Cookbook, jq-hopkok's parallelism folder has some good examples

这篇关于使用多线程/加载大型JSON文件/的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆