在 AWS 中运行 Python ETL 代码的最佳选择 [英] Best Option(s) for Running Python ETL Code in AWS

查看:19
本文介绍了在 AWS 中运行 Python ETL 代码的最佳选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找关于使用哪个 AWS 服务(或其组合)在 Python 中执行 ETL 代码以转换基于文本的文件的建议:

I am looking for a recommendation on which AWS service (or combination thereof) to use to execute an ETL code in Python to transform text-based files:

代码/流程说明:1. Python 代码将输入文本文件从自定义供应商格式转换为 CSV 格式.2. 单个 Python 代码调用转换单个文件,并且可以在任何地方运行一分钟到 10 分钟或更长时间,因为输入文件的大小不同(从 KB 到数百 MB).3. Python 代码需要在新输入文件准备好后立即作为事件运行,这可能随时发生,一天多次.4. 我需要使用 AWS 无服务器选项,因此没有 EC2.

Description of the code/process: 1. Python code transforms input text files from a custom vendor format into CSV format. 2. Single Python code invocation transforms a single file and can run anywhere from a minute to 10 mins or more as the sizes of input files vary (from KBs to 100s of MBs). 3. Python code needs to run as an event as soon as the new input file is ready, which can happen any time, multiple times a day. 4. I need to use AWS serverless options, hence no EC2.

我当前的解决方案是使用 Lambda/S3 事件来检测 S3 中新文件的创建,将其移动"到同一 S3 存储桶中的相应文件夹并触发 AWS Glue Python Shell 作业对其进行转换.我相信 AWS EMR 对于被转换文件的大小来说是一种矫枉过正 (

My current solution is to use Lambda/S3 Event to detect the creation of the new files in S3, "move" it to the appropriate folder in the same S3 bucket and trigger AWS Glue Python Shell Job to transform it. I believe AWS EMR is an overkill for the sizes of the files being transformed (

但是,我愿意接受更好的建议,因为到目前为止 AWS Glue 并不像其他服务(如 Lambda)那样强大和成熟.如果我当前的解决方案看起来不错,请无论如何请加入,这将帮助我确保我走在正确的道路上!

However, I am open to better recommendations as AWS Glue so far does not appear as robust and mature as other services (like Lambda). If my current solution appears sound, please chime in anyway, this will help me to make sure I am on the right path!

谢谢,迈克尔 :)

推荐答案

我们可以在登陆文件夹上配置一个 Lambda S3 事件触发器,当文件上传时,我们可以在 Lambda 中有一个简短的脚本来触发 Glue 作业.胶水 python 脚本应该具有将输入文本文件转换为 CSV 文件所需的逻辑.这样,当文件上传到 S3 时,您的作业可以运行任意次数.

We can configure a Lambda S3 event trigger on the landing folder and when a file is uploaded, we can have a brief script in Lambda to trigger the Glue job. The glue python script should have required logic to convert the input text files into a CSV files. This way your job can be run any number of times when a file is uploaded to the S3.

您的计费也仅适用于作业运行期间.请注意,由于具有托管服务功能,Glue 的成本略高.

Your billing is also only for the duration of the job is run. Please be aware that the cost is little high in Glue due to its managed services feature.

创建事件触发器,触发粘合作业.请在此处找到 AWS Lambda 的代码片段:

Have the event trigger created , trigger the glue job. Please find herewith a code snippet for AWS Lambda:

from __future__ import print_function
import json
import boto3
import time
import sys
import time
from datetime import datetime

s3 = boto3.client('s3')
glue = boto3.client('glue')

def lambda_handler(event, context):
    gluejobname="<< THE GLUE JOB NAME >>"

    try:
        runId = glue.start_job_run(JobName=gluejobname)
        status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
        print("Job Status : ", status['JobRun']['JobRunState'])
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist '
              'and your bucket is in the same region as this '
              'function.'.format(source_bucket, source_bucket))
    raise e

这篇关于在 AWS 中运行 Python ETL 代码的最佳选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆