如何使用AWS Glue运行任意/DDL SQL语句或存储过程 [英] How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue

查看:152
本文介绍了如何使用AWS Glue运行任意/DDL SQL语句或存储过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以从AWS Glue python作业执行任意SQL命令(如ALTER TABLE)?我知道我可以用它来读取表中的数据,但是有没有一种方法可以执行其他特定于数据库的命令?

Is it possible to execute arbitrary SQL commands like ALTER TABLE from AWS Glue python job? I know I can use it to read data from tables but is there a way to execute other database specific commands?

我需要将数据提取到目标数据库中,然后立即运行一些ALTER命令.

I need to ingest data into a target database and then run some ALTER commands right after.

推荐答案

因此,在进行了广泛的研究并在AWS支持下打开了一个案例之后,他们告诉我目前无法通过Python shell或Glue pyspark工作.但是我只是尝试了一些有创造力的方法,并且有效!这个想法是使用sparks依赖的py4j并利用标准的Java sql包.

So after doing extensive research and also opening a case with AWS support, they told me it is not possible from Python shell or Glue pyspark job at this moment. But I just tried something creative and it worked! The idea is to use py4j that sparks relies on already and utilize standard java sql package.

此方法的两个巨大好处:

Two huge benefits of this approach:

  1. 这样做的巨大好处是,您可以将数据库连接定义为Glue数据连接,并在其中保留jdbc详细信息和凭据,而无需在Glue代码中对其进行硬编码.下面的示例通过调用 glueContext.extract_jdbc_conf('your_glue_data_connection_name')来获取在Glue中定义的jdbc网址和凭据,来实现此目的.

  1. A huge benefit of this that you can define your database connection as Glue data connection and keep jdbc details and credentials in there without hardcoding them in the Glue code. My example below does that by calling glueContext.extract_jdbc_conf('your_glue_data_connection_name') to get jdbc url and credentials, defined in Glue.

如果您需要在受支持的即用型Glue数据库上运行SQL命令,则甚至不需要为该数据库使用/传递jdbc驱动程序-只需确保为该数据库设置了Glue连接即可并将该连接添加到您的Glue作业-Glue将上传正确的数据库驱动程序jar.

If you need to run SQL commands on a supported out of the box Glue database, you don't even need to use/pass jdbc driver for that database - just make sure you set up Glue connection for that database and add that connection to your Glue job - Glue will upload proper database driver jars.

请记住,下面的代码是由驱动程序进程执行的,而Spark工作者/执行者不能执行.

Remember this code below is executed by a driver process and cannot be executed by Spark workers/executors.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

logger = glueContext.get_logger()

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# dw-poc-dev spark test
source_jdbc_conf = glueContext.extract_jdbc_conf('your_glue_database_connection_name')

from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")

conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url'), source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))

print(conn.getMetaData().getDatabaseProductName())

# call stored procedure, in this case I call sp_start_job
cstmt = conn.prepareCall("{call dbo.sp_start_job(?)}");
cstmt.setString("job_name", "testjob");
results = cstmt.execute();

conn.close()

这篇关于如何使用AWS Glue运行任意/DDL SQL语句或存储过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆