AWS Glue作业书签 [英] AWS Glue Job Bookmarking

查看:106
本文介绍了AWS Glue作业书签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望查看是否有更多有关AWS Glue中完成工作书签的方式的详细信息. AWS文档对此没有提供太多帮助.我知道里面有基本功能:

Wanted to see if there are more details about the way job bookmarking is done in AWS Glue. AWS docs doesn't provide much on this. I know that there are basic functionality in there:

  • 启用
  • 禁用
  • 暂停
  • 重置

似乎书签同时发生:

job.commit()

我可以访问它吗?可以修改它以重新处理部分源代码吗?

Can I access it? Can it be modified to reprocess some portion of source?

推荐答案

一些其他信息:

工作书签设计的基本策略是节省最后完成的工作的开始时间.因此,当重新运行作业时,它将只处理修改时间戳比转换上下文"参数中已标记为上一个作业的开始"时间新的文件.

The basic tactic for Job Bookmark design would be to save the START time of the last completed job. So when a job is re-run, it will process only the files that have a modification-timestamp newer than the START time of the previous job that was Bookmarked in the Transformation Context parameter.

但是,这种设计的问题在于,在某些情况下,某些文件将被错误地归类为已处理.例如:假设文件被写入S3,而时间戳记恰好在作业开始之前,但是由于S3一致性延迟很小,因此此时该文件对于作业是不可见的.因此,它不会在运行中得到处理,该书签会在作业完成时更新,并且在下一次运行时会跳过该文件,因为它认为它是由于时间戳较早而先前已处理过的.

However, the issue with this design would be that under some conditions, certain files would be incorrectly categorized as processed. For example: suppose a file is written to S3 where the timestamp is just before the job starts, however because of the slight S3 consistency delay, it's not visible to the job at that point. Thus it is not processed in the run, the Bookmark gets updated when the job completes and on the next run it skips the file because it assumes it was previously processed because of the earlier timestamp.

因此,书签"功能不仅可以保存先前作业开始时间的时间戳,还可以保存该时间戳周围某个不确定范围内的文件列表.这将包括在时间戳记之前的时间范围内的文件阈值数量.因此,下一次运行将处理该时间戳之后的任何文件,以及该不确定范围内尚未处理的文件.

The Bookmarks feature is thus designed to not only save the timestamp of previous job start time, but also a list of files in a certain band of uncertainty around that timestamp. This would include a threshold number of files within a time-range before the timestamp. The next run will thus process any file after that timestamp plus the files that are in that band of uncertainty that have not yet been processed.

转换上下文(transformation_ctx)是对已处理文件的内部记录进行更改的元素.然后job.init命令创建或加载书签,而job.commit初始化并提交书签.

The Transformation Context (transformation_ctx) is the element that makes changes to the internal record of processed files. And the job.init command creates or loads a bookmark, while job.commit initializes and commits the bookmark.

希望是有帮助的.

这篇关于AWS Glue作业书签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆