自动将数据从s3批量加载到Aurora MySQL RDS实例 [英] Automate bulk loading of data from s3 to Aurora MySQL RDS instance

查看:324
本文介绍了自动将数据从s3批量加载到Aurora MySQL RDS实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对AWS来说还比较陌生,所以我不确定该怎么做,

I am relatively new to AWS so I am not sure how to go about doing this,

我在s3上具有CSV文件,并且已经在RDS上设置了Aurora实例.我无法弄清楚的是如何自动执行大容量数据加载,本质上就像使用AWS Glue之类的LOAD DATA FROM s3一样.

I have CSV files on s3 and I have already set up the Aurora instance on RDS. The thing that I am unable to figure out is how do I automate the bulk loading of data, essentially doing like a LOAD DATA FROM s3 kind of thing using something like AWS Glue.

我也将s3的Glue原生对象用于RDS,但是从本质上讲,它是一堆通过JDBC连接插入RDS的插入,对于大型数据集来说,这也非常慢.

I also used the Glue native thing of s3 to RDS, but then it is essentially a bunch of inserts into RDS over a JDBC connection which is also super slow for large datasets.

我可以在RDS上独立运行命令来做到这一点,但是我不想这样做,而是想利用Glue.我还考虑过将MySQL连接器用于Python,但Glue本机仅支持Python 2.7,这是我不想使用的东西.

I can do it independently running the command on RDS but I do not want to do that and want to leverage Glue. I also looked at using a MySQL connector for Python but Glue natively only supports Python 2.7 which is something that I do not want to use.

任何帮助将不胜感激.

推荐答案

方法如上所述,具有一个S3事件触发器和一个lambda作业,侦听s3存储桶/对象位置.将文件上传到s3位置后,lambda作业将运行,在lambda中,您可以配置为调用AWS Glue作业.这正是我们所做的,并且已经成功上线了. Lambda的使用寿命为15分钟,因此触发/启动Glue作业所需的时间应少于一分钟.

The approach is as stated above, have an S3 event trigger and a lambda job listening on the s3 bucket/object location. As soon as a file is uploaded to the s3 location, the lambda job will run, and in the lambda, you can configure to call an AWS Glue job. This is exactly we have done and has gone successfully live. Lambda has a 15minute life, and it should take less an a minute to trigger/start a Glue job.

请在此处找到示例来源以供参考.

Please find herewith a sample source for reference.

from __future__ import print_function
import json
import boto3
import time
import urllib

print('Loading function')

s3 = boto3.client('s3')
glue = boto3.client('glue')

def lambda_handler(event, context):
    gluejobname="your-glue-job-name here"

    try:
        runId = glue.start_job_run(JobName=gluejobname)
        status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
        print("Job Status : ", status['JobRun']['JobRunState'])
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist '
              'and your bucket is in the same region as this '
              'function.'.format(source_bucket, source_bucket))
    raise e

要创建Lambda函数,请转到AWS Lambdra->从头开始创建新函数->为事件选择S3,然后配置S3存储桶位置以及所需的前缀.然后复制粘贴上面的代码示例,内联代码区域,并根据需要配置胶粘作业名称.请确保您具有所有必需的IAM角色/访问设置.

For creating a Lambda function, go to AWS Lambdra->Create a new function from Scratch->Select S3 for event, and then configure the S3 bucket locations, prefixes as required. Then copy paste the above code sample, inline code area, and configure the glue job name as needed. Please ensure you have all required IAM roles/access setup.

胶合作业应具有连接到您的Aurora的功能,然后您可以使用Aurora提供的"LOAD FROM S3 ....."命令.确保所有参数组设置/配置均已根据需要完成.

The glue job should have provision to connect to your Aurora, and then you can use "LOAD FROM S3....." command provided by Aurora. Make sure all parameter group settings/configurations are done as needed.

让我知道是否有任何问题.

Let me know if any issues.

更新:从S3加载的示例代码段:

conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
cur, conn = connect()
createStgTable1 = "DROP TABLE IF EXISTS mydb.STG_TABLE;"
createStgTable2 = "CREATE TABLE mydb.STG_TABLE(COL1 VARCHAR(50) NOT NULL, COL2 VARCHAR(50), COL3 VARCHAR(50), COL4 CHAR(1) NOT NULL);"
loadQry = "LOAD DATA FROM S3 PREFIX 's3://<bucketname>/folder' REPLACE INTO TABLE mydb.STG_TABLE FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (@var1, @var2, @var3, @var4) SET col1= @var1, col2= @var2, col3= @var3, col4=@var4;"
cur.execute(createStgTable1)
cur.execute(createStgTable2)
cur.execute(loadQry)
conn.commit()
conn.close()

这篇关于自动将数据从s3批量加载到Aurora MySQL RDS实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆