jq可以跨文件执行聚合 [英] Can jq perform aggregation across files
问题描述
可以使用jq有效地获取如下数据:
file1:
id,年龄,性别,波浪
1,49,M,1
2,72,F,0
file2:
id, event1
1,4 / 20/2095,V39
1,4 / 21/2095,T21
2,5 / 17/2094,V39
通过id聚合(以便多个文件中CSV行中的所有JSON文档都属于单个ID条目),输出如下: / p>
{index:{_ index:forum_mat,_ type:subject,_ id: 1,age:49,gender,name和 :file,id:1,time:4/20/2095,wave:1 event1:V39},{filen:file2,id:1,time:4/21/2095,event1:T21}]}
pre>
{index:{_ index:forum_mat,_ type:subject,_ id:2}}
{id:2,file1:[ {filen:file1,id:2,age:72,gender:F,wave:0}],file2:[{ filen:file2,id:2,time:5/17/2094,event1:V39}]}
我在Matlab中写了一个脚本,但是我担心它会慢下来。我可能需要几个月才能收缩所有40 + GB的数据。我是通知 Logstash(这是ES的首选数据输入工具)不适合这种类型的聚合。
解决方案正如我在其中一条评论中建议的那样,我最终使用SQL以我需要的格式导出JSON。另一个主题帮助极大。最后,我选择将给定的SQL表输出到自己的JSON文件,而不是组合它们(文件大小变得难以管理)。这是执行此操作的代码结构,以便为Bulk API和JSON数据行生成命令行:
create或替换函数format_data_line(命令文本,data_str文本)
返回setof文本语言plpgsql为$$
begin
return next命令;
return next
replace(
regexp_replace(data_str,
'(\d\d\d\d -\d\d -\d\\ \\ d)T','\1','g'),
e'\\\
','');
end $$;
COPY(
with f_1 as(
SELECT id,json_agg(fileX。*)AS tag
FROM forum.file3
GROUP BY id
)
SELECT
format_data_line(
format('{update:{_ index:forum2,_ type:subject,_ id:%s} }',a.id),
format('{doc:{id:%s,fileX:%s}}',
a.id,a.tag) )
FROM f_1 a
)TO'/path/to/json/fileX.json';使用Bulk API导入较大的文件也是有问题的(内存中的Java错误)。
<因此需要一个脚本来在给定时间将数据的子集发送到Curl(用于在Elasticsearch中索引)。该脚本的基本结构是:#!/ bin / bash
FILE = $ 1
INC = 100
numline =`wc -l $ FILE | awk'{print $ 1}'`
rm -f output / $ FILE.txt
for i in`seq 1 $ INC $ numline`; do
TIME =`date +%H:%M:%S`
echo[$ TIME]从$ i到$((i + INC -1))
rm -f intermediateates / interm_file_ $ i.json
sed -n $ i,$((i + INC-1))p $ FILE>> intermediateates / interm_file_ $ i.json
curl -s -XPOST localhost:9200 / _bulk --data-binary @ intermediateates / interm_file_ $ i.json>>输出/ $ FILE.txt
done
脚本文件目录。脚本可以保存为ESscript,并在命令行中运行:
./ ESscript fileX.json
I'm trying to identify a program/software which will allow me to efficiently take a number of large CSV files (totaling 40+ GB) and output a JSON file with the specific format I need for import into Elasticsearch (ES).
Can jq efficiently take data like this:
file1: id,age,gender,wave 1,49,M,1 2,72,F,0 file2: id,time,event1 1,4/20/2095,V39 1,4/21/2095,T21 2,5/17/2094,V39
aggregate it by id (such that all the JSON documents from CSV rows in multiple files fall under a single id entry), outputting something like this:
{"index":{"_index":"forum_mat","_type":"subject","_id":"1"}} {"id":"1","file1":[{"filen":"file1","id":"1","age":"49","gender":"M","wave":"1"}],"file2":[{"filen":"file2","id":"1","time":"4/20/2095","event1":"V39"},{"filen":"file2","id":"1","time":"4/21/2095","event1":"T21"}]} {"index":{"_index":"forum_mat","_type":"subject","_id":"2"}} {"id":"2","file1":[{"filen":"file1","id":"2","age":"72","gender":"F","wave":"0"}],"file2":[{"filen":"file2","id":"2","time":"5/17/2094","event1":"V39"}]}
I wrote a script in Matlab but as I was worried about it is much to slow. I might take months to crunch all 40+GB of data. I was informed that Logstash (which is the preferred data input tool for ES) isn't good at this type of aggregation.
解决方案As suggested in one of the comments I ended up using SQL to export JSON in the format I required. Another thread helped tremendously. In the end I choose to output a given SQL table to its own JSON file instead of combining them (the file size was becoming unmanageable). This is the code structure to do that such that you produce the command line for the Bulk API and the JSON data line:
create or replace function format_data_line(command text, data_str text) returns setof text language plpgsql as $$ begin return next command; return next replace( regexp_replace(data_str, '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'), e' \n ', ''); end $$; COPY ( with f_1 as( SELECT id, json_agg(fileX.*) AS tag FROM forum.file3 GROUP BY id ) SELECT format_data_line( format('{"update":{"_index":"forum2","_type":"subject","_id":%s}}',a.id), format('{"doc":{"id":%s,"fileX":%s}}', a.id, a.tag)) FROM f_1 a ) TO '/path/to/json/fileX.json';
Importing the larger files with the Bulk API also turned out to be problematic (out of memory Java errors) so a script was needed to only send subsets of the data to Curl (for indexing in Elasticsearch) at a given time. The basic structure for that script is:
#!/bin/bash FILE=$1 INC=100 numline=`wc -l $FILE | awk '{print $1}'` rm -f output/$FILE.txt for i in `seq 1 $INC $numline`; do TIME=`date +%H:%M:%S` echo "[$TIME] Processing lines from $i to $((i + INC -1))" rm -f intermediates/interm_file_$i.json sed -n $i,$((i +INC - 1))p $FILE >> intermediates/interm_file_$i.json curl -s -XPOST localhost:9200/_bulk --data-binary @intermediates/interm_file_$i.json >> output/$FILE.txt done
An "intermediates" directory should be created beneath the script files directory. The script can be saved as "ESscript" and run on the command line with:
./ESscript fileX.json
这篇关于jq可以跨文件执行聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!