在Athena中查询可选的嵌套JSON字段 [英] Querying optional nested JSON fields in Athena

查看:60
本文介绍了在Athena中查询可选的嵌套JSON字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有看起来像的json数据:

I have json data that looks something like:

{ "col1" : 123, "metadata" : { "opt1" : 456, "opt2" : 789 } }

其中各种元数据字段(其中有很多)是可选的,可能会也可能不会出现.

where the various metadata fields (of which there are many) are optional and may or may not be present.

我的查询是:

select col1, metadata.opt1 from "db-name".tablename

如果任何行中都不存在opt1,我希望它返回opt1列的所有行,并带有空白,但是如果metadata中没有一行包含opt1的行,则当搜寻器运行了(并且它是可选的,并且在运行查询时可能仍然不存在于数据中),查询失败,并显示以下信息:

If opt1 is not present in any rows, I would expect this to return all rows with a blank for the opt1 column, but if there wasn't a row with the opt1 in metadata when the crawler ran (and might still not be present in data when the query is run, as it's optional), the query fails, with:

SYNTAX_ERROR: line 2:1: Column '"metadata"."opt1"' cannot be resolved

我可以在模式定义中手动指定这些字段(如果我不使用搜寻器),但是这样就不会获取可能到达的任何新的元数据字段,并且似乎没有指定静态模式秉承雅典娜应该如何运作的精神.

I could specify these fields manually either in the schema definition (if I don't use a crawler), but then it wouldn't pick up any new metadata fields that may arrive, and specifying a static schema doesn't seem to be in the spirit of how Athena is supposed to work.

如何使它达到预期的功能(最好不要在其中放置虚拟行或自定义SerDe)?

How do I get this to function as expected (preferably without putting dummy rows in or customizing the SerDe)?

目前使用SerDe org.openx.data.jsonserde.JsonSerDe.

Using SerDe org.openx.data.jsonserde.JsonSerDe at present.

感谢任何想法.

推荐答案

您可能不想听,但我建议您不要使用Glue Crawler.这只是当您的用例与设计用例不完全匹配时所产生的问题的冰山一角(例如,请参见此问题此问题此问题).

It might not be what you want to hear, but I advise you to not use Glue Crawler. This is just the tip of the iceberg of the problems it creates when your use case doesn't fit exactly with the use cases it was designed for (see for example this question, this question, this question, this question, or this question).

相反,可以使用工作时为您创建的任何Glue Crawler手动创建表(您可以在Athena中使用SHOW CREATE TABLE foo获取表的DDL).然后使用 ALTER TABLE foo ADD PARTITION .

Instead, create the table manually using whatever Glue Crawler created for you when it worked (you can get the DDL for a table with SHOW CREATE TABLE foo in Athena). Then add partitions manually with ALTER TABLE foo ADD PARTITION.

无论使用哪种方法,使用可选字段使表保持最新状态将变得很复杂.如果仅添加,则可以在添加具有更多列的新分区时更新表的列(如果使用Athena进行操作,则在添加分区之前进行此操作),但是另一种方法是只键入metadata列作为STRING并使用JSON函数提取查询中的属性(请参见例如

Keeping the table up to date with optional fields is going to be complicated, whatever method you use. If you only ever add you can update the table's columns when you add a new partition that has more columns (if you do it with Athena do it before you add the partition), but another way would be to simply type the metadata column as STRING and use JSON functions to extract the properties in your queries (see for example this question/answer).

我假设您正在使用Glue Crawler定期添加分区.如果您控制将数据添加到S3的过程,建议您在其中添加运行

I assume you're using Glue Crawler to add partitions periodically. If you're in control of the process that adds data to S3 I suggest you add code there that runs an ALTER TABLE … ADD PARTITION (or uses CreatePartition in the Glue API.

如果您无法控制该代码,或者非常不便,则可以使用Lambda解决问题.例如,如果您仅按时间分区,则可以每天运行一次并添加第二天的分区(S3上不必有任何数据,您可以添加不包含数据的分区,它是只是元数据).如果比这更复杂,则可以在S3上创建新文件并添加分区作为响应时触发Lambda函数运行.

If you're not in control of that code, or it would be very inconvenient, you can solve the problem with Lambda. If you, for example, only partition by time, you can run it once per day and add the next day's partition (there doesn't have to be any data on S3, you can add partitions that don't yet contain data, it's just metadata). If it's more complex than that you can trigger the Lambda function to run when new files are created on S3 and add the partitions as a reaction.

这听起来比使用Glue Crawler更为复杂,并且如果Glue Crawlers确实按照您期望的那样工作.由于它们实际上不能很好地工作,因此工作量将大大减少.

This might sound more complicated than using Glue Crawler, and if Glue Crawlers actually worked as you expect them to it would be. Since they don't really work very well, it's going to be a lot less work.

这篇关于在Athena中查询可选的嵌套JSON字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆