jq可以跨文件执行聚合 [英] Can jq perform aggregation across files

查看:138
本文介绍了jq可以跨文件执行聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出一个程序/软件,它将允许我有效地采取一些大型CSV文件(总共40+ GB),并输出一个JSON文件,我需要导入到Elasticsearch(ES) 。



可以使用jq有效地获取如下数据:

  file1:
id,年龄,性别,波浪
1,49,M,1
2,72,F,0

file2:
id, event1
1,4 / 20/2095,V39
1,4 / 21/2095,T21
2,5 / 17/2094,V39

通过id聚合(以便多个文件中CSV行中的所有JSON文档都属于单个ID条目),输出如下: / p>

  {index:{_ index:forum_mat,_ type:subject,_ id: 1,age:49,gender,name和 :file,id:1,time:4/20/2095,wave:1 event1:V39},{filen:file2,id:1,time:4/21/2095,event1:T21}]} 
{index:{_ index:forum_mat,_ type:subject,_ id:2}}
{id:2,file1:[ {filen:file1,id:2,age:72,gender:F,wave:0}],file2:[{ filen:file2,id:2,time:5/17/2094,event1:V39}]}
pre>

我在Matlab中写了一个脚本,但是我担心它会慢下来。我可能需要几个月才能收缩所有40 + GB的数据。我是通知 Logstash(这是ES的首选数据输入工具)不适合这种类型的聚合。

解决方案

正如我在其中一条评论中建议的那样,我最终使用SQL以我需要的格式导出JSON。另一个主题帮助极大。最后,我选择将给定的SQL表输出到自己的JSON文件,而不是组合它们(文件大小变得难以管理)。这是执行此操作的代码结构,以便为Bulk API和JSON数据行生成命令行:

  create或替换函数format_data_line(命令文本,data_str文本)
返回setof文本语言plpgsql为$$
begin
return next命令;
return next
replace(
regexp_replace(data_str,
'(\d\d\d\d -\d\d -\d\\ \\ d)T','\1','g'),
e'\\\
','');
end $$;

COPY(
with f_1 as(
SELECT id,json_agg(fileX。*)AS tag
FROM forum.file3
GROUP BY id

SELECT
format_data_line(
format('{update:{_ index:forum2,_ type:subject,_ id:%s} }',a.id),
format('{doc:{id:%s,fileX:%s}}',
a.id,a.tag) )
FROM f_1 a
)TO'/path/to/json/fileX.json';使用Bulk API导入较大的文件也是有问题的(内存中的Java错误)。



<因此需要一个脚本来在给定时间将数据的子集发送到Curl(用于在Elasticsearch中索引)。该脚本的基本结构是:

 #!/ bin / bash 

FILE = $ 1
INC = 100
numline =`wc -l $ FILE | awk'{print $ 1}'`
rm -f output / $ FILE.txt
for i in`seq 1 $ INC $ numline`; do
TIME =`date +%H:%M:%S`
echo[$ TIME]从$ i到$((i + INC -1))
rm -f intermediateates / interm_file_ $ i.json
sed -n $ i,$((i + INC-1))p $ FILE>> intermediateates / interm_file_ $ i.json
curl -s -XPOST localhost:9200 / _bulk --data-binary @ intermediateates / interm_file_ $ i.json>>输出/ $ FILE.txt
done

脚本文件目录。脚本可以保存为ESscript,并在命令行中运行:

  ./ ESscript fileX.json 


I'm trying to identify a program/software which will allow me to efficiently take a number of large CSV files (totaling 40+ GB) and output a JSON file with the specific format I need for import into Elasticsearch (ES).

Can jq efficiently take data like this:

file1:
id,age,gender,wave
1,49,M,1
2,72,F,0

file2:
id,time,event1
1,4/20/2095,V39
1,4/21/2095,T21
2,5/17/2094,V39

aggregate it by id (such that all the JSON documents from CSV rows in multiple files fall under a single id entry), outputting something like this:

{"index":{"_index":"forum_mat","_type":"subject","_id":"1"}}
{"id":"1","file1":[{"filen":"file1","id":"1","age":"49","gender":"M","wave":"1"}],"file2":[{"filen":"file2","id":"1","time":"4/20/2095","event1":"V39"},{"filen":"file2","id":"1","time":"4/21/2095","event1":"T21"}]}
{"index":{"_index":"forum_mat","_type":"subject","_id":"2"}}
{"id":"2","file1":[{"filen":"file1","id":"2","age":"72","gender":"F","wave":"0"}],"file2":[{"filen":"file2","id":"2","time":"5/17/2094","event1":"V39"}]}

I wrote a script in Matlab but as I was worried about it is much to slow. I might take months to crunch all 40+GB of data. I was informed that Logstash (which is the preferred data input tool for ES) isn't good at this type of aggregation.

解决方案

As suggested in one of the comments I ended up using SQL to export JSON in the format I required. Another thread helped tremendously. In the end I choose to output a given SQL table to its own JSON file instead of combining them (the file size was becoming unmanageable). This is the code structure to do that such that you produce the command line for the Bulk API and the JSON data line:

create or replace function format_data_line(command text, data_str text)
returns setof text language plpgsql as $$
begin
    return next command;
    return next             
        replace(
            regexp_replace(data_str,
                '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
            e' \n ', '');
end $$;

COPY (
    with f_1 as(
       SELECT id, json_agg(fileX.*) AS tag
       FROM forum.file3
       GROUP BY id
    )
    SELECT 
        format_data_line(
            format('{"update":{"_index":"forum2","_type":"subject","_id":%s}}',a.id),
            format('{"doc":{"id":%s,"fileX":%s}}', 
                a.id, a.tag))
    FROM f_1 a 
) TO '/path/to/json/fileX.json';

Importing the larger files with the Bulk API also turned out to be problematic (out of memory Java errors) so a script was needed to only send subsets of the data to Curl (for indexing in Elasticsearch) at a given time. The basic structure for that script is:

#!/bin/bash

FILE=$1
INC=100
numline=`wc -l $FILE | awk '{print $1}'`
rm -f output/$FILE.txt
for i in `seq 1 $INC $numline`; do
    TIME=`date +%H:%M:%S`
    echo "[$TIME] Processing lines from $i to $((i + INC -1))"
    rm -f intermediates/interm_file_$i.json
    sed -n $i,$((i +INC - 1))p $FILE >> intermediates/interm_file_$i.json
    curl -s -XPOST localhost:9200/_bulk --data-binary @intermediates/interm_file_$i.json >> output/$FILE.txt
done

An "intermediates" directory should be created beneath the script files directory. The script can be saved as "ESscript" and run on the command line with:

./ESscript fileX.json

这篇关于jq可以跨文件执行聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆