重试一定次数后如何使(cron)工作失败? [英] How to fail a (cron) job after a certain number of retries?
问题描述
我们建立了一个Kubernetes集群,其中包含Web抓取Cron作业.在cron作业开始失败之前,一切似乎都进行得很好(例如,当网站结构发生变化且我们的抓取工具不再起作用时).看起来偶尔会有一些失败的cron作业会继续重试,直到导致集群崩溃.运行kubectl get cronjobs
(在群集故障之前)将显示正在运行的作业太多,而该作业失败.
We have a Kubernetes cluster of web scraping cron jobs set up. All seems to go well until a cron job starts to fail (e.g., when a site structure changes and our scraper no longer works). It looks like every now and then a few failing cron jobs will continue to retry to the point it brings down our cluster. Running kubectl get cronjobs
(prior to a cluster failure) will show too many jobs running for a failing job.
我尝试遵循此处关于pod退避失败政策的已知问题;但是,这似乎不起作用.
I've attempted following the note described here regarding a known issue with the pod backoff failure policy; however, that does not seem to work.
这是我们的配置供参考:
Here is our config for reference:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: scrape-al
spec:
schedule: '*/15 * * * *'
concurrencyPolicy: Allow
failedJobsHistoryLimit: 0
successfulJobsHistoryLimit: 0
jobTemplate:
metadata:
labels:
app: scrape
scrape: al
spec:
template:
spec:
containers:
- name: scrape-al
image: 'govhawk/openstates:1.3.1-beta'
command:
- /opt/openstates/openstates/pupa-scrape.sh
args:
- al bills --scrape
restartPolicy: Never
backoffLimit: 3
理想情况下,我们希望在N次重试后终止cron作业(例如,在my-cron-job
失败5次后,类似kubectl delete cronjob my-cron-job
之类的东西).任何想法或建议将不胜感激.谢谢!
Ideally we would prefer that a cron job would be terminated after N retries (e.g., something like kubectl delete cronjob my-cron-job
after my-cron-job
has failed 5 times). Any ideas or suggestions would be much appreciated. Thanks!
推荐答案
您可以使用backoffLimit
告诉作业停止重试.
You can tell your Job to stop retrying using backoffLimit
.
指定标记此作业失败之前的重试次数.
Specifies the number of retries before marking this job failed.
以您的情况
spec:
template:
spec:
containers:
- name: scrape-al
image: 'govhawk/openstates:1.3.1-beta'
command:
- /opt/openstates/openstates/pupa-scrape.sh
args:
- al bills --scrape
restartPolicy: Never
backoffLimit: 3
您将作业的3设置为backoffLimit
.这意味着当CronJob创建作业时,如果失败,它将重试3次.这控制着Job,而不是CronJob
You set 3 asbackoffLimit
of your Job. That means when a Job is created by CronJob, It will retry 3 times if fails. This controls Job, not CronJob
作业失败时,将再次创建另一个作业作为您的计划时间.
When Job is failed, another Job will be created again as your scheduled period.
您要: 如果我没看错,那么当您计划的作业失败5次时,您要停止计划新作业.对吧?
You want: If I am not wrong, you want to stop scheduling new Job, when your scheduled Jobs are failed for 5 times. Right?
答案: 在这种情况下,这不可能自动.
Answer: In that case, this is not possible automatically.
可能的解决方案: 您需要暂停 CronJob,以便它停止调度新作业
Possible solution: You need to suspend CronJob so than it stop scheduling new Job.
Suspend: true
您可以手动执行此操作.如果您不想手动执行此操作,则需要设置一个监视程序,该监视程序将监视您的CronJob状态,并在必要时将CronJob更新为挂起.
You can do this manually. If you do not want to do this manually, you need to setup a watcher, that will watch your CronJob status, and will update CronJob to suspend if necessary.
这篇关于重试一定次数后如何使(cron)工作失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!