使用AWS Glue时如何查找更新的行? [英] How to look for updated rows when using AWS Glue?

查看：67 发布时间：2020/8/23 19:31:52 amazon-web-services pyspark etl aws-glue

本文介绍了使用AWS Glue时如何查找更新的行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对从RDS转移到Redshift的数据使用Glue进行ETL.

I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift.

据我所知，Glue书签仅使用指定的主键查找新行，而不会跟踪更新的行.

As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.

但是，我正在使用的数据往往会频繁更新行，因此我正在寻找一种可能的解决方案.我对pyspark有点陌生，因此，如果有可能在pyspark中做到这一点，我将不胜感激一些指导或朝着正确方向的观点.如果Spark之外还有可能的解决方案，我也很想听听.

However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well.

推荐答案

您可以使用查询通过在源JDBC数据库中过滤数据来查找更新的记录，如下例所示.我已将date用作参数，因此在此示例中，对于每次运行，我只能从mysql数据库中获取最新值.

You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.

query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"

datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()

这篇关于使用AWS Glue时如何查找更新的行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用AWS Glue时如何查找更新的行? [英] How to look for updated rows when using AWS Glue?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用AWS Glue时如何查找更新的行? [英] How to look for updated rows when using AWS Glue?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭