使用AWS Glue时如何查找更新的行? [英] How to look for updated rows when using AWS Glue?

查看:67
本文介绍了使用AWS Glue时如何查找更新的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对从RDS转移到Redshift的数据使用Glue进行ETL.

I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift.

据我所知,Glue书签仅使用指定的主键查找新行,而不会跟踪更新的行.

As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.

但是,我正在使用的数据往往会频繁更新行,因此我正在寻找一种可能的解决方案.我对pyspark有点陌生,因此,如果有可能在pyspark中做到这一点,我将不胜感激一些指导或朝着正确方向的观点.如果Spark之外还有可能的解决方案,我也很想听听.

However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well.

推荐答案

您可以使用查询通过在源JDBC数据库中过滤数据来查找更新的记录,如下例所示.我已将date用作参数,因此在此示例中,对于每次运行,我只能从mysql数据库中获取最新值.

You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.

query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"

datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()

这篇关于使用AWS Glue时如何查找更新的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆