AWS Glue爬网程序创建分区和文件表 [英] AWS Glue Crawler Creates Partition and File Tables

查看:271
本文介绍了AWS Glue爬网程序创建分区和文件表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常基本的s3设置,我想使用Athena进行查询.数据全部存储在一个存储桶中,并分为年/月/日/小时文件夹.

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...

然后我设置一个AWS Glue爬网程序来爬网s3://bucket/data.所有文件中的架构都是相同的.我希望我会得到一个数据库表,并在年,月,日等上进行分区.

I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

我得到的是成千上万张表.每个文件都有一个表,每个父分区也都有一个表.据我所知,为每个文件/文件夹创建了单独的表,而没有一个总体表,可以在较大的日期范围内进行查询.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

我按照说明 https://docs.aws.amazon .com/glue/latest/dg/crawler-configuration.html 尽我所能,但是却无法弄清楚如何构造分区/扫描,以至于我不会得到这么大的,几乎毫无用处的转储数据.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

推荐答案

Glue Crawler有很多不足之处.它有望解决许多情况,但实际上在实际支持方面受到限制.如果您的数据存储在目录中并且未使用Hive样式的分区(例如year=2019/month=02/file.json),则经常会造成混乱.当数据由看起来像您的数据的其他AWS产品(例如Kinesis Firehose)产生时,尤其令人沮丧.

Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (e.g. year=2019/month=02/file.json) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.

根据您拥有的数据量,我可能首先创建一个指向该结构根的未分区的Athena表.只有当您的数据增长到超过数GB或成千上万个文件时,分区才变得重要.

Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.

您可以采用的另一种策略是添加一个Lambda函数,只要有新对象落入存储桶中,该函数就会由S3通知触发.该函数可以查看键并找出它属于哪个分区,然后使用Glue API将该分区添加到表中.添加已经存在的分区会从API返回错误,但是只要您的函数捕获了该分区并忽略它,您就可以了.

Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.

这篇关于AWS Glue爬网程序创建分区和文件表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆