在 Athena 中查询可选的嵌套 JSON 字段 [英] Querying optional nested JSON fields in Athena

查看:32
本文介绍了在 Athena 中查询可选的嵌套 JSON 字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似于以下内容的 json 数据:

I have json data that looks something like:

{ "col1" : 123, "metadata" : { "opt1" : 456, "opt2" : 789 } }

其中各种元数据字段(其中有很多)是可选的,可能存在也可能不存在.

where the various metadata fields (of which there are many) are optional and may or may not be present.

我的查询是:

select col1, metadata.opt1 from "db-name".tablename

如果 opt1 不存在于任何行中,我希望这将返回 opt1 列的所有空白行,但如果没有行使用 opt1 在爬虫运行时 metadata 中(并且在查询运行时可能仍然不存在于数据中,因为它是可选的),查询失败,具有:

If opt1 is not present in any rows, I would expect this to return all rows with a blank for the opt1 column, but if there wasn't a row with the opt1 in metadata when the crawler ran (and might still not be present in data when the query is run, as it's optional), the query fails, with:

SYNTAX_ERROR: line 2:1: Column '"metadata"."opt1"' cannot be resolved

我可以在模式定义中手动指定这些字段(如果我不使用爬虫),但是它不会选择任何可能到达的新元数据字段,并且似乎没有指定静态模式本着雅典娜应该如何工作的精神.

I could specify these fields manually either in the schema definition (if I don't use a crawler), but then it wouldn't pick up any new metadata fields that may arrive, and specifying a static schema doesn't seem to be in the spirit of how Athena is supposed to work.

我如何让它按预期运行(最好不要将虚拟行放入或自定义 SerDe)?

How do I get this to function as expected (preferably without putting dummy rows in or customizing the SerDe)?

目前使用 SerDe org.openx.data.jsonserde.JsonSerDe.

Using SerDe org.openx.data.jsonserde.JsonSerDe at present.

感谢您的任何想法.

推荐答案

这可能不是您想听到的,但我建议您不要使用 Glue Crawler.当您的用例与其设计的用例不完全吻合时,这只是它产生的问题的冰山一角(参见例如 这个问题这个问题这个问题,这个问题,或这个问题).

It might not be what you want to hear, but I advise you to not use Glue Crawler. This is just the tip of the iceberg of the problems it creates when your use case doesn't fit exactly with the use cases it was designed for (see for example this question, this question, this question, this question, or this question).

相反,使用在工作时为您创建的任何 Glue Crawler 手动创建表(您可以在 Athena 中使用 SHOW CREATE TABLE foo 获取表的 DDL).然后使用 ALTER TABLE 手动添加分区foo 添加分区.

Instead, create the table manually using whatever Glue Crawler created for you when it worked (you can get the DDL for a table with SHOW CREATE TABLE foo in Athena). Then add partitions manually with ALTER TABLE foo ADD PARTITION.

无论您使用何种方法,通过可选字段使表格保持最新都将是复杂的.如果您只添加,则可以在添加具有更多列的新分区时更新表的列(如果您使用 Athena 执行此操作,请在添加分区之前执行此操作),但另一种方法是只需键入 元数据 列作为 STRING 并使用 JSON 函数提取查询中的属性(参见例如 这个问题/答案).

Keeping the table up to date with optional fields is going to be complicated, whatever method you use. If you only ever add you can update the table's columns when you add a new partition that has more columns (if you do it with Athena do it before you add the partition), but another way would be to simply type the metadata column as STRING and use JSON functions to extract the properties in your queries (see for example this question/answer).

我假设您使用 Glue Crawler 定期添加分区.如果您控制将数据添加到 S3 的过程,我建议您在那里添加运行 ALTER TABLE ... ADD PARTITION(或使用 CreatePartition 在胶水 API.

I assume you're using Glue Crawler to add partitions periodically. If you're in control of the process that adds data to S3 I suggest you add code there that runs an ALTER TABLE … ADD PARTITION (or uses CreatePartition in the Glue API.

如果您无法控制该代码,或者会非常不方便,您可以使用 Lambda 解决问题.例如,如果您只按时间分区,则可以每天运行一次并添加第二天的分区(S3 上不必有任何数据,您可以添加尚未包含数据的分区,它是只是元数据).如果它比这更复杂,您可以在 S3 上创建新文件时触发 Lambda 函数运行并添加分区作为反应.

If you're not in control of that code, or it would be very inconvenient, you can solve the problem with Lambda. If you, for example, only partition by time, you can run it once per day and add the next day's partition (there doesn't have to be any data on S3, you can add partitions that don't yet contain data, it's just metadata). If it's more complex than that you can trigger the Lambda function to run when new files are created on S3 and add the partitions as a reaction.

这听起来可能比使用 Glue Crawler 更复杂,如果 Glue Crawler 真的像您期望的那样工作,那将会是.因为它们的效果不是很好,所以工作量会少很多.

This might sound more complicated than using Glue Crawler, and if Glue Crawlers actually worked as you expect them to it would be. Since they don't really work very well, it's going to be a lot less work.

这篇关于在 Athena 中查询可选的嵌套 JSON 字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆