如何在 AWS Athena 中自动执行 MSCK REPAIR TABLE [英] How to make MSCK REPAIR TABLE execute automatically in AWS Athena

查看:41
本文介绍了如何在 AWS Athena 中自动执行 MSCK REPAIR TABLE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每小时执行一次的 Spark 批处理作业.每次运行都会使用目录命名模式 DATA/YEAR=?/MONTH=?/DATE=?/datafileS3 中生成和存储新数据.

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.

将数据上传到 S3 后,我想使用 Athena 对其进行调查.另外,我想通过连接到 Athena 作为数据源,在 QuickSight 中将它们可视化.

After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.

问题是每次运行我的 Spark 批处理后,存储在 S3 中的新生成的数据将不会被 Athena 发现,除非我手动运行查询 MSCK REPAIR TABLE.

The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.

有没有办法让 Athena 自动更新数据,这样我就可以创建一个全自动的数据可视化管道?

Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?

推荐答案

有多种方法可以安排此任务.您如何安排工作流程?你使用像 AirflowLuigiAzkaban、cron 或使用 AWS 数据管道?

There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?

从其中任何一个中,您应该能够触发以下 CLI 命令.

From any of these, you should be able to fire off the following CLI command.

$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"

另一种选择是 AWS Lambda.您可以有一个函数调用 MSCK REPAIR TABLE some_database.some_table 以响应新的 S3 上传.

Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.

一个 Lambda 函数示例可以这样写:

An example Lambda Function could be written as such:

import boto3

def lambda_handler(event, context):
    bucket_name = 'some_bucket'

    client = boto3.client('athena')

    config = {
        'OutputLocation': 's3://' + bucket_name + '/',
        'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}

    }

    # Query Execution Parameters
    sql = 'MSCK REPAIR TABLE some_database.some_table'
    context = {'Database': 'some_database'}

    client.start_query_execution(QueryString = sql, 
                                 QueryExecutionContext = context,
                                 ResultConfiguration = config)

然后,您将配置触发器以在存储桶中的 DATA/ 前缀下添加新数据时执行您的 Lambda 函数.

You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.

最终,在使用作业调度程序运行 Spark 作业后显式重建分区具有自我记录的优势.另一方面,AWS Lambda 对于这样的工作很方便.

Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.

这篇关于如何在 AWS Athena 中自动执行 MSCK REPAIR TABLE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆