如何通过S3 Events或AWS Lambda触发Glue ETL Pyspark作业? [英] How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?
问题描述
我计划使用Pyspark在AWS Glue ETL中写入某些作业,我想在将新文件放入AWS S3位置时触发该作业,就像我们使用S3事件触发AWS Lambda函数一样
I'm planning to write certain jobs in AWS Glue ETL using Pyspark, which I want to get triggered as and when a new file is dropped in an AWS S3 Location, just like we do for triggering AWS Lambda Functions using S3 Events.
但是,我只看到非常狭窄的选项来触发Glue ETL脚本.在这方面的任何帮助都将受到高度赞赏.
But, I see very narrowed down options only, to trigger a Glue ETL script. Any help on this shall be highly appreciated.
推荐答案
以下内容应该可以触发AWS Lambda的Glue作业.将lambda配置为适当的S3存储桶,并将IAM角色/权限分配给AWS Lambda,以便lambda可以代表用户启动AWS Glue作业.
The following should work to trigger a Glue job from AWS Lambda. Have the lambda configured to the appropriate S3 bucket, and IAM roles / permissions assigned to AWS Lambda so that lambda can start the AWS Glue job on behalf of the user.
import boto3
print('Loading function')
def lambda_handler(event, context):
source_bucket = event['Records'][0]['s3']['bucket']['name']
s3 = boto3.client('s3')
glue = boto3.client('glue')
gluejobname = "YOUR GLUE JOB NAME"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist '
'and your bucket is in the same region as this '
'function.'.format(source_bucket, source_bucket))
raise e
这篇关于如何通过S3 Events或AWS Lambda触发Glue ETL Pyspark作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!