当s3数据存储同时包含json和.gz压缩文件时,如何通过Glue搜寻器创建AWS Athena表? [英] How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

查看:133
本文介绍了当s3数据存储同时包含json和.gz压缩文件时,如何通过Glue搜寻器创建AWS Athena表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的预期解决方案中有两个问题:

I have two problems in my intended solution:

1。
我的S3商店结构如下:

1. My S3 store structure is as following:

mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz

所有json文件都具有相同的架构,我想使搜寻器指向到 mainfolder / ,然后可以在Athena中创建一个表进行查询。

All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.

我已经尝试了一种文件格式,例如如果文件只是 json gz ,那么搜寻器就可以正常工作,但是我正在寻找一种解决方案,通过该解决方案,我可以自动化两种类型的文件处理。我愿意编写自定义脚本或任何现成的解决方案,但需要从何处开始的指针。

I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.

2。
第二个问题是我的json数据具有一个字段(列),爬网程序将其解释为 struct 数据,但我想将该字段类型设为 string 。原因是如果类型仍然是 struct ,则日期/小时分区会出现不匹配错误,因为显然struct数据在文件中没有相同的内部架构。我试图做一个自定义分类器,但是那里没有描述数据类型的选项。

2. The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.

推荐答案

我建议跳过使用爬虫。以我的经验,胶履带不值得他们引起的问题。使用Glue API创建表很容易,添加分区也是如此。该API有点冗长,尤其是添加了分区,但它比尝试使搜寻器执行您想要的操作要容易得多。

I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.

您当然也可以从雅典娜创建表,这样您就可以确保您获得了可与Athena配合使用的表(否则,您需要掌握一些细节)。 添加分区也不太冗长通过Athena使用SQL,但速度较慢。

You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.

这篇关于当s3数据存储同时包含json和.gz压缩文件时,如何通过Glue搜寻器创建AWS Athena表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆