使用AWS Glue时如何查找更新的行? [英] How to look for updated rows when using AWS Glue?
问题描述
我正在尝试对从RDS转移到Redshift的数据使用Glue进行ETL.
I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift.
据我所知,Glue书签仅使用指定的主键查找新行,而不会跟踪更新的行.
As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.
但是,我正在使用的数据往往会频繁更新行,因此我正在寻找一种可能的解决方案.我对pyspark有点陌生,因此,如果有可能在pyspark中做到这一点,我将不胜感激一些指导或朝着正确方向的观点.如果Spark之外还有可能的解决方案,我也很想听听.
However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well.
推荐答案
您可以使用查询通过在源JDBC数据库中过滤数据来查找更新的记录,如下例所示.我已将date用作参数,因此在此示例中,对于每次运行,我只能从mysql数据库中获取最新值.
You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.
query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"
datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()
这篇关于使用AWS Glue时如何查找更新的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!