使用JSONPath的Redshift COPY缺少数组/字段 [英] Redshift COPY using JSONPath for missing array/fields

查看:112
本文介绍了使用JSONPath的Redshift COPY缺少数组/字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用COPY命令将JSON数据集从S3加载到Redshift表.数据正在部分加载,但会忽略缺少数据(键值/数组)的记录,即,从下面的示例中,仅第一个记录将被加载.

I am using the COPY command to load the JSON dataset from S3 to Redshift table. The data is getting loaded partially but it ignores records which has missing data(key-value/array) i.e. from the below example only the first record will get loaded.

查询:

从's3://mybucket/address.json'复制地址
凭证'aws_access_key_id = XXXXXXX; aws_secret_access_key = XXXXXXX' maxerror为250
json's3:/mybucket/address_jsonpath.json';

COPY address from 's3://mybucket/address.json'
credentials 'aws_access_key_id=XXXXXXX;aws_secret_access_key=XXXXXXX' maxerror as 250
json 's3:/mybucket/address_jsonpath.json';

我的问题是,即使某些记录缺少键/数据(类似于下面的示例数据集),我如何也可以从address.json加载所有记录.

My question is how can I load all the records from address.json even when some records will have missing key/data, similar to the below sample data set.

JSON示例

{
  "name": "Sam P",
  "addresses": [
    {
      "zip": "12345",
      "city": "Silver Spring",
      "street_address": "2960 Silver Ave",
      "state": "MD"
    },
    {
      "zip": "99999",
      "city": "Curry",
      "street_address": "2960 Silver Ave",
      "state": "PA"
    }
  ]
}
{
  "name": "Sam Q",
  "addresses": [ ]
}
{
  "name": "Sam R"
}

是否有 FILLRECORD 用于JSON数据集?

Is there an alternative to FILLRECORD for JSON dataset?

我正在寻找一种可以在Redshift表中加载以上所有3条记录的实现或解决方法.

I am looking for an implementation or a workaround which can load all the above 3 records in the Redshift table.

推荐答案

没有与JSON中的COPY等效的FILLRECORD. 在以下内容中明确不支持文档.

There is no FILLRECORD equivalent for COPY from JSON. It is explicitly not supported in the documentation.

但是您有一个更根本的问题-第一个记录包含多个addresses的数组. Redshift的COPY from JSON不允许您从嵌套数组创建多行.

But you have a more fundamental issue - the first record contains an array of multiple addresses. Redshift's COPY from JSON does not allow you to create multiple rows from nested arrays.

解决此问题的最简单方法是将文件定义为加载为外部表并使用我们的

The simplest way to resolve this is to define the files to be loaded as an external table and use our nested data syntax to expand the embedded array into full rows. Then use an INSERT INTO to load the data to a final table.

DROP TABLE IF EXISTS spectrum.partial_json;
CREATE EXTERNAL TABLE spectrum.partial_json (
  name       VARCHAR(100),
  addresses  ARRAY<STRUCT<zip:INTEGER
                         ,city:VARCHAR(100)
                         ,street_address:VARCHAR(255)
                         ,state:VARCHAR(2)>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-test-files/partial_json/'
;

INSERT INTO final_table 
SELECT ext.name
     , address.zip
     , address.city
     , address.street_address
     , address.state
FROM spectrum.partial_json ext
LEFT JOIN ext.addresses address ON true
;
--  name  |  zip  |     city      | street_address  | state
-- -------+-------+---------------+-----------------+-------
--  Sam P | 12345 | Silver Spring | 2960 Silver Ave | MD
--  Sam P | 99999 | Curry         | 2960 Silver Ave | PA
--  Sam Q |       |               |                 |
--  Sam R |       |               |                 |

NB:我对您的示例JSON进行了一些调整,以使其变得更简单.例如,您有未加密的对象作为name的值,我将其设置为纯字符串值.

NB: I tweaked your example JSON a little to make this simpler. For instance you had un-keyed objects as the values for name that I made into plain string values.

这篇关于使用JSONPath的Redshift COPY缺少数组/字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆