使用 AWS Glue 时如何查找更新的行? [英] How to look for updated rows when using AWS Glue?

查看:18
本文介绍了使用 AWS Glue 时如何查找更新的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Glue 用于从 RDS 转移到 Redshift 的数据的 ETL.

I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift.

据我所知,Glue 书签仅使用指定的主键查找新行,而不会跟踪更新的行.

As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.

但是,我正在处理的数据往往会经常更新行,我正在寻找可能的解决方案.我对 pyspark 有点陌生,所以如果可以在 pyspark 中做到这一点,我非常感谢一些指导或正确方向的观点.如果在 Spark 之外有可能的解决方案,我也很乐意听到.

However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well.

推荐答案

您可以使用查询通过过滤源 JDBC 数据库中的数据来查找更新的记录,如下例所示.我已将日期作为参数传递,因此在此示例中,每次运行我只能从 mysql 数据库中获取最新值.

You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.

query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"

datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()

这篇关于使用 AWS Glue 时如何查找更新的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆