格式化Postgres用于嵌套弹性搜索结构的JSON输出 [英] Formatting Postgres JSON output for nested Elasticsearch structure

查看:137
本文介绍了格式化Postgres用于嵌套弹性搜索结构的JSON输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经意识到,使用SQL数据库(Postgres)是将关系数据(24个CSV文件中的40+ GB)移植到Elasticsearch中的最有效的方法之一是使用嵌套结构。但是我仍然有一些问题与Postgres的JSON输出的格式:1)不需要的换行符(\\\
),2)不需要的标题行和3)不需要的日期格式。以下是一个基本的示例:

  file1 
id,age,gender,wave
1, 49,M,1
2,72,F,0

file2
id,time,event1
1,2095-04-20 12:28:55 ,V39
1,2095-04-21 2:27:45,T21
2,2094-05-17 18:17:25,V39

file3
id,time,event2
1,2095-04-22 3:48:53,P90
2,2094-05-18 1:28:23,RT4
2,2094- 05-18 4:23:53,W3

将这些CSV添加到名为论坛并运行的模式这个SQL代码:

 与f_1 as(
SELECT id,json_agg(file1。*)AS标签
FROM forum.file1
GROUP BY id
),f_2 as(
SELECT id,json_agg(file2。*)AS标签
FROM forum.file2
GROUP BY id
),f_3 as(
SELECT id,json_agg(file3。*)AS标签
FROM forum.file3
GROUP BY id

SELECT ('{id:|| a.id),(''file1:'|| a.tag),(''file2:'|| b.tag),(''file3 || c.tag ||'}')
FROM f_1 AS a,f_2 AS b,f_3 AS c
WHERE b.id = a.id AND c.id = a.id;

我得到这个输出(pgAdminIII - 导出到文件 - 没有引用):

 ?列?,?列?,?列?,?列? 
{id:1,file1:[{id:1,age:49,gender:M,wave:1}],file2 {id:1,time:2095-04-20T12:28:55,event1:V39},
{id:1,time 04-21T02:27:45\" , 事件1: T21}], file3的:[{ ID:1, 时间: 2095-04-22T03:48:53, 事件2 :P90}}}
{id:2,file1:[{id:2,age:72,gender:F,wave ], 文件2:[{ ID:2 时间: 2094-05-17T18:17:25, 事件1: V39}], file3的:[{ ID: 2,time:2094-05-18T01:28:23,event2:RT4},
{id:2,time:2094-05-18T04:23 :53,event2:W3}}}

你可以看到, id有多行数据。我需要所有的数据在一行上给定一个id(即没有\\\
)。我没有花很多时间但想改变的其他一些小问题是第一行是不需要的,我想摆脱?列?,?列? ,?列?,?列?,而无需在完成处理后打开文件。理想情况下,我也希望在日期输出中没有 T 。我应该能够容纳Elasticsearch中的 T ,但到目前为止还没有接受它。这是Postgres的输出,它可以用于输入到Elasticsearch(使用 stream2es 和嵌套的映射结构) :

  {id:1,file1:[{id:1,age gender:M,wave:1}],file2:[{id:1,time:2095-04-20 12:28:55,event1 V39},{id:1,time:2095-04-21 02:27:45,event1:T21}],file3:[{id time:2095-04-22 03:48:53,event2:P90}}} 
{id:2,file1:[{id age:72,gender:F,wave:0}],file2:[{id:2,time:2094-05-17 18:17:25 ,event1:V39}],file3:[{id:2,time:2094-05-18 01:28:23,event2:RT4},{ id:2,time:2094-05-18 04:23:53,event2:W3}}}

添加 to_json 可以修复不需要的换行符,但它会添加 \代替 stream2es解析器不喜欢的内容:



SELECT to_json('{id:|| a.id),to_json('file1:'|| a.tag json),to_json('file2:'|| b.tag :: json),to_json('file3:'|| c.tag :: json ||'}')



{\ id\ :1, \ file1\ :[{\ id\ :1,\ age\: 49,\gender\:\M\,\wave \:1}] ...



es2stream异常:
线程stream dispatcher中的异常java.lang.ClassCastException:java.lang.String不能转换为clojure.lang.IPersistentMap

解决方案

全部选中一列(而不是四列)。功能 format()将有助于你更清楚地写下来。使用

  regexp_replace(str,'(\d\d\d\d- \d\d -\d\d)T','\1','g')

更正日期格式和

 替换(str,e'\\\
','')

跳过换行符。



使用 COPY命令以简化问题:

  COPY(
with f_1 as(
SELECT id,json_agg(file1。*)AS tag
FROM forum.file1
GROUP BY id
),f_2 as(
SELECT id,json_agg(file2。*)AS标签
FROM forum.file2
GROUP BY id
),f_3 as
SELECT id,json_agg(file3。*)AS标签
FROM forum.file3
GROUP BY id

SELECT
replace(
regexp_replace(
格式('{id:% s,file1:%s,file2:%s,file3:%s}',
a.id,a.tag,b.tag,c.tag),
'(\d\d\d\d- \d\d- \\ n','')
FROM f_1 AS a,f_2 AS b,f_3 AS c
WHERE b.id = a.id AND c.id = a.id
)TO' /全/路径/到/你/文件;






使用命令行预加每行数据你可以使用一个功能返回两行的技巧。
某些部分格式可以移动到该功能。

 创建或替换函数format_data_line(命令文本,data_str文本)
返回setof文本语言plpgsql作为$$
开始
返回下一个命令;
return next
replace(
regexp_replace(data_str,
'(\d\d\d\d- \d\d-\d\\ \\ d)T','\1','g'),
e'\\\
','');
end $$;

COPY(
with f_1 as(
SELECT id,json_agg(file1。*)AS标签
FROM forum.file1
GROUP BY id
),f_2 as(
SELECT id,json_agg(file2。*)AS标签
FROM forum.file2
GROUP BY id
),f_3 as(
SELECT id,json_agg(file3。*)AS标签
FROM forum.file3
GROUP BY id

SELECT
format_data_line(
'my command' ,
格式('{id:%s,file1:%s,file2:%s,file3:%s}',
a.id,a.tag ,b.tag,c.tag))
FROM f_1 AS a,f_2 AS b,f_3 AS c
WHERE b.id = a.id AND c.id = a.id
)TO'/ full / path / to / your / file';


I've come to realize that using a SQL database (Postgres) is one of the most efficient ways to port my relational data (40+ GB across 24 CSV files) into Elasticsearch with a nested structure. However I'm still having a couple issues with the formatting of my JSON output from Postgres: 1) undesired line feeds (\n), 2) undesired header line and 3) undesired date format. Here is a basic example to demonstrate:

file1
id,age,gender,wave
1,49,M,1
2,72,F,0

file2
id,time,event1
1,2095-04-20 12:28:55,V39
1,2095-04-21 2:27:45,T21
2,2094-05-17 18:17:25,V39

file3
id,time,event2
1,2095-04-22 3:48:53,P90
2,2094-05-18 1:28:23,RT4
2,2094-05-18 4:23:53,W3

after adding these CSVs to a schema named forum and running this SQL code:

with f_1 as(
   SELECT id, json_agg(file1.*) AS tag
   FROM forum.file1
   GROUP BY id
), f_2 as (
   SELECT id, json_agg(file2.*) AS tag
   FROM forum.file2
   GROUP BY id
), f_3 as (
   SELECT id, json_agg(file3.*) AS tag
   FROM forum.file3
   GROUP BY id
)
SELECT ('{"id":' || a.id), ('"file1":' || a.tag), ('"file2":' || b.tag), ('"file3":' || c.tag ||'}') 
FROM f_1 AS a, f_2 AS b, f_3 AS c
WHERE b.id = a.id AND c.id = a.id;

I get this output (pgAdminIII - Export to file - no quoting):

?column?,?column?,?column?,?column?
{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20T12:28:55","event1":"V39"}, 
 {"id":1,"time":"2095-04-21T02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22T03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17T18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18T01:28:23","event2":"RT4"}, 
 {"id":2,"time":"2094-05-18T04:23:53","event2":"W3"}]}

You can see that for a given id there is data on multiple lines. I need all of the data to be on one line for a given id (i.e. no \n's). A couple other minor issues which I haven't spent much time on but would like to change are the first row isn't needed, I'd like to get rid of the ?column?,?column?,?column?,?column? without having to open the file after it's done processing. Ideally I'd also prefer that there was no T in the date output. I should be able to accommodate the T in Elasticsearch but thus far haven't gotten it to accept it. This is the output I desire from Postgres which works for input into Elasticsearch (using stream2es and a nested mapping structure):

{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20 12:28:55","event1":"V39"},{"id":1,"time":"2095-04-21 02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22 03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17 18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18 01:28:23","event2":"RT4"},{"id":2,"time":"2094-05-18 04:23:53","event2":"W3"}]}

Adding to_json does fix the undesired line feeds but it adds \" in place of " which the stream2es parser doesn't like:

SELECT to_json('{"id":' || a.id), to_json('"file1":' || a.tag::json), to_json('"file2":' || b.tag::json), to_json('"file3":' || c.tag::json ||'}')

"{\"id\":1","\"file1\":[{\"id\":1,\"age\":49,\"gender\":\"M\",\"wave\":1}]"...

es2stream exception: Exception in thread "stream dispatcher" java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IPersistentMap

解决方案

Select all in one column (instead of four). The function format() will help you to write it down more clearly. Use

regexp_replace (str, '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g')

to correct the date format and

replace (str, e' \n ', '')

to skip newline chars.

Use COPY command to simplify the issue:

COPY (
    with f_1 as(
       SELECT id, json_agg(file1.*) AS tag
       FROM forum.file1
       GROUP BY id
    ), f_2 as (
       SELECT id, json_agg(file2.*) AS tag
       FROM forum.file2
       GROUP BY id
    ), f_3 as (
       SELECT id, json_agg(file3.*) AS tag
       FROM forum.file3
       GROUP BY id
    )
    SELECT
        replace(
            regexp_replace(
                format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', 
                    a.id, a.tag, b.tag, c.tag),
                '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
            e' \n ', '')
    FROM f_1 AS a, f_2 AS b, f_3 AS c
    WHERE b.id = a.id AND c.id = a.id
) TO '/full/path/to/your/file';


To prepend each line of data with a command line you can use a trick with a function returning two rows. Some part of formatting can be moved to the function on the occasion.

create or replace function format_data_line(command text, data_str text)
returns setof text language plpgsql as $$
begin
    return next command;
    return next             
        replace(
            regexp_replace(data_str,
                '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
            e' \n ', '');
end $$;

COPY (
    with f_1 as(
       SELECT id, json_agg(file1.*) AS tag
       FROM forum.file1
       GROUP BY id
    ), f_2 as (
       SELECT id, json_agg(file2.*) AS tag
       FROM forum.file2
       GROUP BY id
    ), f_3 as (
       SELECT id, json_agg(file3.*) AS tag
       FROM forum.file3
       GROUP BY id
    )
    SELECT 
        format_data_line(
            'my command', 
            format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', 
                a.id, a.tag, b.tag, c.tag))
    FROM f_1 AS a, f_2 AS b, f_3 AS c
    WHERE b.id = a.id AND c.id = a.id
) TO '/full/path/to/your/file';

这篇关于格式化Postgres用于嵌套弹性搜索结构的JSON输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆