AWS Glue:如何使用各种模式处理嵌套JSON [英] AWS Glue: How to handle nested JSON with varying schemas

查看:408
本文介绍了AWS Glue:如何使用各种模式处理嵌套JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:
我们希望使用AWS Glue数据目录为S3存储桶中的JSON数据创建一个表,然后通过Redshift查询和解析频谱。

Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.

背景:
JSON数据来自DynamoDB Streams,并且深度嵌套。 JSON的第一级具有一致的元素集:密钥,NewImage,OldImage,SequenceNumber,近似CreationDateTime,SizeBytes和EventName。唯一的变化是,有些记录没有NewImage,有些没有OldImage。但是,在此第一层以下,架构会有很大不同。

Background: The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.

理想情况下,我们希望使用Glue来解析此第一层JSON,并基本上处理较低的层作为大型STRING对象(然后我们将根据需要使用Redshift Spectrum对其进行解析)。目前,我们正在将整个记录加载到Redshift中的单个VARCHAR列中,但是记录已接近Redshift中数据类型的最大大小(最大VARCHAR长度为65535)。因此,我们希望在记录到达Redshift之前执行第一级解析。

Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.

到目前为止我们尝试/引用的内容:


  • 将AWS Glue Crawler指向S3存储桶会产生数百个具有一致顶级架构的表(上面列出的属性) ),但在STRUCT元素的更深层次上使用不同的模式。我们尚未找到一种创建可从所有这些表中读取并将其加载到单个表中的Glue ETL作业的方法。

  • 手动创建表并没有取得成功。我们尝试将每个列设置为STRING数据类型,但是该作业未能成功加载数据(大概是因为这将涉及从STRUCT到STRING的某种转换)。将列设置为STRUCT时,它需要一个已定义的模式-但这恰恰是一条记录到另一条记录的不同,因此我们无法提供适用于所有相关记录的通用STRUCT模式。

  • AWS Glue 关系化转换很有意思,但不是我们在这种情况下要寻找的内容(因为我们想保持一些JSON完整,而不是完全变平)。 Redshift Spectrum支持标量JSON 数据,但这不适用于我们正在处理的嵌套JSON。这些似乎都无法帮助处理Glue Crawler创建的数百个表。

  • Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
  • Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
  • The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.

问题:
我们将如何使用Glue(或其他方法)来仅解析这些记录的第一级-而忽略顶层元素下方的变化模式-以便我们可以从Spectrum或

Question: How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?

我是Glue的新手。我花了很多时间在Glue文档中,并浏览了论坛上的(稀疏的)信息。我可能会遗漏一些明显的东西-也许这是目前形式的Glue的局限性。欢迎任何建议。

I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.

谢谢!

推荐答案

我不确定是否可以使用表定义来执行此操作,但是可以通过使用映射函数将顶级值转换为JSON字符串来使用ETL作业来完成此操作。文档:[链接]

I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]

import json

# Your mapping function
def flatten(rec):
    for key in rec:
        rec[key] = json.dumps(rec[key])
    return rec

old_df = glueContext.create_dynamic_frame.from_options(
    's3',
    {"paths": ['s3://...']},
    "json")

# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)

从这里开始,您可以选择导出到S3(也许以Parquet或其他柱状格式进行优化以进行查询),或者从我的理解中直接导入Redshift,尽管我没有尝试过

From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.

这篇关于AWS Glue:如何使用各种模式处理嵌套JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆