AWS Glue书签产生重复项 [英] AWS Glue Bookmark produces duplicates

查看:175
本文介绍了AWS Glue书签产生重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Python脚本(实际上是pyspark)提交给Glue Job,以处理拼花地板文件并从该数据源中提取一些分析数据.

I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source.

这些镶木地板文件位于S3文件夹中,并随着新数据的增加而不断增加.我对AWS Glue提供的书签逻辑感到满意,因为它有很大帮助:基本上,我们可以仅处理新数据,而无需重新处理已处理的数据.

These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data.

不幸的是,在这种情况下,我注意到每次生成重复项时,看起来AWS Glue书签根本不起作用.这种意外行为的原因是什么?

Unfortunately in this scenario I notice instead that each time duplicates are produced and looks like that AWS Glue bookmarking is not working at all. What's the reason of this unexpected behaviour?

推荐答案

来自

当前不支持Apache Parquet和ORC格式.

The Apache Parquet and ORC formats are currently not supported.

更新

因为

Since Jul 26 2019 AWS Glue supports Parquet and ORC formats as well for bookmarking

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

这篇关于AWS Glue书签产生重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆