存储在S3中的AWS Glue Crawlers和大型表 [英] AWS Glue Crawlers and large tables stored in S3

查看:133
本文介绍了存储在S3中的AWS Glue Crawlers和大型表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对AWS Glue及其爬网程序有一些一般性问题。我将一些数据流放入S3存储桶,并使用AWS Athena将它们作为redshift中的外部表进行访问。
这些表按小时进行分区,一些自动爬网程序每小时更新一次分区和表结构。

I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift. The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour.

问题在于,爬虫花费的时间越来越长,有一天它们会在不到一个小时的时间内完成。
是否有一些设置可以加快此过程,或者可以替代AWS Glue中的爬网程序?

The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour. Is there some setting in order to speed up this process or some proper alternative to the crawlers in AWS Glue?

推荐答案

遗憾的是,没有供胶水爬行器用来优化性能的配置选项。但是,据我所知,AWS Glue团队应该发布一项功能,该功能可以显着提高爬虫的性能(尽管不知道日期)。

Unfortunately there are not config options for Glue Crawlers to tune performance. However, as far as I know AWS Glue team should release a feature that improves performance of crawlers significantly (don't know the date though).

通常,在数据目录中注册新分区的几种方法:

In general, there are few ways to register new partitions in Data Catalog:


  1. 运行胶水爬行器

  2. 运行 MSCK修复表< table> 雅典娜查询

  3. 通过雅典娜添加分区

  4. 通过Glue API添加分区

  1. Run a Glue Crawler
  2. Run MSCK REPAIR TABLE <table> Athena query
  3. Add partition via Athena
  4. Add partition via Glue API

最有效的方法是手动添加分区(3或4)。因此,如果您知道何时以及在哪个新分区上进行注册,则可以设置一个lambda函数来调用Athena或Glue API。 Lambda本身可能由SNS或CloudWatch事件触发。

The most efficient way is to add partition manually (3 or 4). So if you know when and which new partitions should be registered then you can setup a lambda function to call Athena or a Glue API. The lambda itself might be triggered by SNS or CloudWatch event.

这篇关于存储在S3中的AWS Glue Crawlers和大型表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆