如何在 AWS Athena 中自动执行 MSCK REPAIR TABLE [英] How to make MSCK REPAIR TABLE execute automatically in AWS Athena
问题描述
我有一个每小时执行一次的 Spark 批处理作业.每次运行都会使用目录命名模式 DATA/YEAR=?/MONTH=?/DATE=?/datafile
在 S3
中生成和存储新数据.
I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3
with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile
.
将数据上传到 S3
后,我想使用 Athena
对其进行调查.另外,我想通过连接到 Athena 作为数据源,在 QuickSight
中将它们可视化.
After uploading the data to S3
, I want to investigate it using Athena
. Also, I would like to visualize them in QuickSight
by connecting to Athena as a data source.
问题是每次运行我的 Spark 批处理后,存储在 S3
中的新生成的数据将不会被 Athena 发现,除非我手动运行查询 MSCK REPAIR TABLE代码>.
The problem is that after each run of my Spark batch, the newly generated data stored in S3
will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE
.
有没有办法让 Athena 自动更新数据,这样我就可以创建一个全自动的数据可视化管道?
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
推荐答案
有多种方法可以安排此任务.您如何安排工作流程?你使用像 Airflow、Luigi、Azkaban、cron 或使用 AWS 数据管道?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
从其中任何一个中,您应该能够触发以下 CLI 命令.
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
另一种选择是 AWS Lambda.您可以有一个函数调用 MSCK REPAIR TABLE some_database.some_table
以响应新的 S3 上传.
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table
in response to a new upload to S3.
一个 Lambda 函数示例可以这样写:
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
然后,您将配置触发器以在存储桶中的 DATA/
前缀下添加新数据时执行您的 Lambda 函数.
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/
prefix in your bucket.
最终,在使用作业调度程序运行 Spark 作业后显式重建分区具有自我记录的优势.另一方面,AWS Lambda 对于这样的工作很方便.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
这篇关于如何在 AWS Athena 中自动执行 MSCK REPAIR TABLE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!