使用jq处理大文件时提高性能 [英] Improving performance when using jq to process large files

查看:55
本文介绍了使用jq处理大文件时提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用换行符分隔的JSON 将JSON数据的大文件(~5G)拆分为较小的文件一种内存有效的方式(即,不必将整个JSON blob读取到内存中).每个源文件中的JSON数据都是一个对象数组.

I need to split large files (~5G) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to read the entire JSON blob into memory). The JSON data in each source file is an array of objects.

不幸的是,源数据不是 以换行符分隔的JSON ,在某些情况下,文件中根本没有换行符.这意味着我不能简单地使用split命令通过换行符将大文件拆分为较小的块.下面是有关如何在每个文件中存储源数据的示例:

Unfortunately, the source data is not newline-delimited JSON and in some cases there are no newlines in the files at all. This means I can't simply use the split command to split the large file into smaller chunks by newline. Here are examples of how the source data is stored in each file:

带有换行符的源文件示例.

Example of a source file with newlines.

[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
...
,{"id": 9, "name": "qux"}]

不带换行符的源文件示例.

Example of a source file without newlines.

[{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}, ...{"id": 9, "name": "qux"}]

以下是单个输出文件所需格式的示例:

Here's an example of the desired format for a single output file:

{"id": 1, "name": "foo"}
{"id": 2, "name": "bar"}
{"id": 3, "name": "baz"}

当前解决方案

通过使用 jq split作为 SO Post 中进行了介绍.由于 jq流解析器,因此该方法可提高内存效率.这是达到预期效果的命令:

Current Solution

I'm able to achieve the desired result by using jq and split as described in this SO Post. This approach is memory efficient thanks to the jq streaming parser. Here's the command that achieves the desired result:

cat large_source_file.json \
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  | split --line-bytes=1m --numeric-suffixes - split_output_file

问题

上面的命令需要~47 mins来处理整个源文件.这似乎很慢,尤其是与sed相比,后者可以更快地产生相同的输出.

The Problem

The command above takes ~47 mins to process through the entire source file. This seems quite slow, especially when compared to sed which can produce the same output much faster.

这里有一些性能基准测试,显示了jqsed的处理时间.

Here are some performance benchmarks to show processing time with jq vs. sed.

export SOURCE_FILE=medium_source_file.json  # smaller 250MB

# using jq
time cat ${SOURCE_FILE} \
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  | split --line-bytes=1m - split_output_file

real    2m0.656s
user    1m58.265s
sys     0m6.126s

# using sed
time cat ${SOURCE_FILE} \
  | sed -E 's#^\[##g' \
  | sed -E 's#^,\{#\{#g' \
  | sed -E 's#\]$##g' \
  | sed 's#},{#}\n{#g' \
  | split --line-bytes=1m - sed_split_output_file

real    0m25.545s
user    0m5.372s
sys     0m9.072s

问题

  1. sed相比,jq的预期处理速度是否较慢?考虑到jq会在后台进行大量验证,因此它的速度会变慢,但是4倍的速度变慢似乎并不正确.
  2. 我可以采取什么措施来提高jq处理该文件的速度?我更喜欢使用jq来处理文件,因为我相信它可以无缝处理其他行输出格式,但是鉴于我每天要处理数千个文件,因此很难证明我观察到的速度差异是合理的.
  1. Is this slower processing speed expected for jq compared to sed? It makes sense jq would be slower given it's doing a lot of validation under the hood, but 4X slower doesn't seem right.
  2. Is there anything I can do to improve the speed at which jq can process this file? I'd prefer to use jq to process files because I'm confident it could seamlessly handle other line output formats, but given I'm processing thousands of files each day, it's hard to justify the speed difference I've observed.

推荐答案

jq的 streaming 解析器(使用--stream命令行选项调用的解析器)为了减少内存而故意牺牲了速度.要求,如下面的指标"部分所示.达到不同平衡(似乎更接近您所寻找的一种)的工具是jstream,其首页是 https://github.com/bcicen/jstream

jq's streaming parser (the one invoked with the --stream command-line option) intentionally sacrifices speed for the sake of reduced memory requirements, as illustrated below in the metrics section. A tool which strikes a different balance (one which seems to be closer to what you're looking for) is jstream, the homepage of which is https://github.com/bcicen/jstream

在bash或类似bash的shell中运行命令序列:

Running the sequence of commands in a bash or bash-like shell:

cd
go get github.com/bcicen/jstream
cd go/src/github.com/bcicen/jstream/cmd/jstream/
go build

将生成一个可执行文件,您可以像这样调用它:

will result in an executable, which you can invoke like so:

jstream -d 1 < INPUTFILE > STREAM

假设INPUTFILE包含一个(可能是巨大的)JSON数组,则上面的行为类似于jq的.[],带有jq的-c(紧凑)命令行选项.实际上,如果INPUTFILE包含JSON数组流或JSON非标量流...

Assuming INPUTFILE contains a (possibly ginormous) JSON array, the above will behave like jq's .[], with jq's -c (compact) command-line option. In fact, this is also the case if INPUTFILE contains a stream of JSON arrays, or a stream of JSON non-scalars ...

对于手头的任务(流式处理数组的顶级项):

For the task at hand (streaming the top-level items of an array):

                  mrss   u+s
jq --stream:      2 MB   447
jstream    :      8 MB   114
jq         :  5,582 MB    39

换句话说:

  1. space:jstream在内存方面很经济,但不及jq的流解析器.

  1. space: jstream is economical with memory, but not as much as jq's streaming parser.

time:jstream的运行速度比jq的常规解析器慢 但比jq的流解析器快4倍.

time: jstream runs slightly slower than jq's regular parser but about 4 times faster than jq's streaming parser.

有趣的是,两个流解析器的时空*大致相同.

Interestingly, space*time is about the same for the two streaming parsers.

测试文件包含10,000,000个简单对象的数组:

The test file consists of an array of 10,000,000 simple objects:

[
{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
,{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
...
]

$ ls -l input.json
-rw-r--r--  1 xyzzy  staff  980000002 May  2  2019 input.json

$ wc -l input.json
 10000001 input.json

jq时间和夫人

$ /usr/bin/time -l jq empty input.json
       43.91 real        37.36 user         4.74 sys
4981452800  maximum resident set size

$ /usr/bin/time -l jq length input.json
10000000
       48.78 real        41.78 user         4.41 sys
4730941440  maximum resident set size

/usr/bin/time -l jq type input.json
"array"
       37.69 real        34.26 user         3.05 sys
5582196736  maximum resident set size

/usr/bin/time -l jq 'def count(s): reduce s as $i (0;.+1); count(.[])' input.json
10000000
       39.40 real        35.95 user         3.01 sys
5582176256  maximum resident set size

/usr/bin/time -l jq -cn --stream 'fromstream(1|truncate_stream(inputs))' input.json | wc -l
      449.88 real       444.43 user         2.12 sys
   2023424  maximum resident set size
 10000000

jstream时间和夫人

$ /usr/bin/time -l jstream -d 1 < input.json > /dev/null
       61.63 real        79.52 user        16.43 sys
   7999488  maximum resident set size

$ /usr/bin/time -l jstream -d 1 < input.json | wc -l
       77.65 real        93.69 user        20.85 sys
   7847936  maximum resident set size
 10000000

这篇关于使用jq处理大文件时提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆