从数据流写入BigQuery-作业完成后不会删除JSON文件 [英] Writing to BigQuery from Dataflow - JSON files are not deleted when a job finishes

查看:60
本文介绍了从数据流写入BigQuery-作业完成后不会删除JSON文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的一个Dataflow作业将其输出写入BigQuery.我对这是如何在后台实现的理解是,Dataflow实际上将结果(分片)以JSON格式写入GCS,然后启动BigQuery加载作业以导入该数据.

One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under-the-hood, is that Dataflow actually writes the results (sharded) in JSON format to GCS, and then kicks off a BigQuery load job to import that data.

但是,我们已经注意到,作业完成后,无论成功还是失败,都不会删除某些JSON文件.错误消息中没有警告或建议,将不会删除文件.当我们注意到这一点时,我们查看了一下存储桶,其中包含数百个来自失败作业的大JSON文件(主要是在开发过程中).

However, we've noticed that some JSON files are not deleted after the job regardless of whether it succeeds or fails. There is no warning or suggestion in the error message that the files will not be deleted. When we noticed this, we had a look at our bucket and it had hundreds of large JSON files from failed jobs (mostly during development).

我本以为即使工作失败,Dataflow仍应处理所有清理工作,并且在工作成功后,绝对应删除这些文件,而在工作完成后将这些文件留在原处会导致巨大的存储成本!

I would have thought that Dataflow should handle any cleanup, even if the job fails, and when it succeeds those files should definitely be deleted Leaving these files around after the job has finished incurs significant storage costs!

这是一个错误吗?

成功"但在GCS中留下了数百个大文件的作业的示例作业ID: 2015-05-27_18_21_21-8377993823053896089

Example job id of a job that "succeeded" but left hundreds of large files in GCS: 2015-05-27_18_21_21-8377993823053896089

推荐答案

由于这种情况仍在发生,因此我们决定在管道完成执行后清理自己.我们运行以下命令来删除不是JAR或ZIP的所有内容:

Because this is still happening we decided that we'll clean up ourselves after the pipeline has finished executing. We run the following command to delete everything that is not a JAR or ZIP:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r

这篇关于从数据流写入BigQuery-作业完成后不会删除JSON文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆