如何使用 AWS Glue 运行任意/DDL SQL 语句或存储过程 [英] How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue

查看:40
本文介绍了如何使用 AWS Glue 运行任意/DDL SQL 语句或存储过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以从 AWS Glue python 作业执行任意 SQL 命令,例如 ALTER TABLE?我知道我可以用它从表中读取数据,但是有没有办法执行其他数据库特定的命令?

Is it possible to execute arbitrary SQL commands like ALTER TABLE from AWS Glue python job? I know I can use it to read data from tables but is there a way to execute other database specific commands?

我需要将数据摄取到目标数据库中,然后立即运行一些 ALTER 命令.

I need to ingest data into a target database and then run some ALTER commands right after.

推荐答案

因此,在进行了广泛的研究并在 AWS 支持下打开了一个案例后,他们告诉我目前无法从 Python shell 或 Glue pyspark 作业中实现.但我只是尝试了一些有创意的东西,它奏效了!这个想法是使用sparks已经依赖的py4j并使用标准的java sql包.

So after doing extensive research and also opening a case with AWS support, they told me it is not possible from Python shell or Glue pyspark job at this moment. But I just tried something creative and it worked! The idea is to use py4j that sparks relies on already and utilize standard java sql package.

这种方法的两大好处:

  1. 这样做的一个巨大好处是,您可以将数据库连接定义为 Glue 数据连接,并将 jdbc 详细信息和凭据保存在其中,而无需在 Glue 代码中对其进行硬编码.我下面的示例通过调用 glueContext.extract_jdbc_conf('your_glue_data_connection_name') 来获取在 Glue 中定义的 jdbc url 和凭据.

  1. A huge benefit of this that you can define your database connection as Glue data connection and keep jdbc details and credentials in there without hardcoding them in the Glue code. My example below does that by calling glueContext.extract_jdbc_conf('your_glue_data_connection_name') to get jdbc url and credentials, defined in Glue.

如果您需要在支持的开箱即用 Glue 数据库上运行 SQL 命令,您甚至不需要为该数据库使用/传递 jdbc 驱动程序 - 只需确保为该数据库设置了 Glue 连接并将该连接添加到您的 Glue 作业 - Glue 将上传正确的数据库驱动程序 jar.

If you need to run SQL commands on a supported out of the box Glue database, you don't even need to use/pass jdbc driver for that database - just make sure you set up Glue connection for that database and add that connection to your Glue job - Glue will upload proper database driver jars.

请记住,下面的代码是由驱动程序进程执行的,不能由 Spark 工作线程/执行程序执行.

Remember this code below is executed by a driver process and cannot be executed by Spark workers/executors.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

logger = glueContext.get_logger()

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# dw-poc-dev spark test
source_jdbc_conf = glueContext.extract_jdbc_conf('your_glue_database_connection_name')

from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")

conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url'), source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))

print(conn.getMetaData().getDatabaseProductName())

# call stored procedure, in this case I call sp_start_job
cstmt = conn.prepareCall("{call dbo.sp_start_job(?)}");
cstmt.setString("job_name", "testjob");
results = cstmt.execute();

conn.close()

这篇关于如何使用 AWS Glue 运行任意/DDL SQL 语句或存储过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆