AWS Glue书签产生重复项 [英] AWS Glue Bookmark produces duplicates

查看：175 发布时间：2020/8/24 0:20:30 amazon-web-services apache-spark parquet aws-glue

本文介绍了AWS Glue书签产生重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在将Python脚本(实际上是pyspark)提交给Glue Job，以处理拼花地板文件并从该数据源中提取一些分析数据.

I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source.

这些镶木地板文件位于S3文件夹中，并随着新数据的增加而不断增加.我对AWS Glue提供的书签逻辑感到满意，因为它有很大帮助:基本上，我们可以仅处理新数据，而无需重新处理已处理的数据.

These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data.

不幸的是，在这种情况下，我注意到每次生成重复项时，看起来AWS Glue书签根本不起作用.这种意外行为的原因是什么?

Unfortunately in this scenario I notice instead that each time duplicates are produced and looks like that AWS Glue bookmarking is not working at all. What's the reason of this unexpected behaviour?

AWS Glue书签产生重复项 [英] AWS Glue Bookmark produces duplicates

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

AWS Glue书签产生重复项 [英] AWS Glue Bookmark produces duplicates

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭