AWS Glue Crawler 创建分区和文件表 [英] AWS Glue Crawler Creates Partition and File Tables

查看:25
本文介绍了AWS Glue Crawler 创建分区和文件表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常基本的 s3 设置,我想使用 Athena 进行查询.数据全部存储在一个桶中,组织成年/月/日/小时文件夹.

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...

然后我设置了一个 AWS Glue Crawler 来抓取 s3://bucket/data.所有文件中的架构都是相同的.我希望我会得到一个数据库表,其中包含年、月、日等分区.

I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

我得到的是数万张桌子.每个文件都有一个表,每个父分区也有一个表.据我所知,为每个文件/文件夹创建了单独的表,没有一个可以在大日期范围内查询的总体表.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

我按照说明操作https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html 尽我所能,但无法弄清楚如何构建我的分区/扫描,这样我就不会得到这个巨大的,几乎毫无价值的转储数据.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

推荐答案

Glue Crawler 有很多不足之处.它承诺可以解决很多情况,但它实际支持的内容确实有限.如果您的数据存储在目录中并且不使用 Hive 样式的分区(例如 year=2019/month=02/file.json),它通常会混乱.当数据是由其他 AWS 产品(例如 Kinesis Firehose)生成时尤其令人沮丧,而您的数据看起来可能是这样.

Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (e.g. year=2019/month=02/file.json) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.

根据您拥有多少数据,我可能会首先创建一个指向结构根的未分区 Athena 表.只有当您的数据增长超过数 GB 或数千个文件时,分区才会变得重要.

Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.

您可以采用的另一种策略是添加一个 Lambda 函数,每当有新对象进入您的存储桶时,该函数就会由 S3 通知触发.该函数可以查看键并确定它属于哪个分区,然后使用 Glue API 将该分区添加到表中.添加一个已经存在的分区会从 API 返回一个错误,但只要你的函数捕获它并忽略它你就可以了.

Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.

这篇关于AWS Glue Crawler 创建分区和文件表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆