在谷歌数据流工作器上暂存文件 [英] Staging files on google dataflow worker

查看:23
本文介绍了在谷歌数据流工作器上暂存文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Dataflow SDK 中是否有任何内容可以让我在工作器上暂存资源文件?我需要在文件系统上为执行 NLP 的自定义 DoFn 提供特定的静态文件资源.我的目标是从类加载器中获取 zip 文件资源,并在工作器文件系统上仅在初始化工作器时将其解压缩一次,而不是尝试在自定义 DoFn 中执行此操作.

Is there anything in the Dataflow SDK that would allow me to stage resource files on a worker? I have specific static file resources that I need to make available on the file system for a custom DoFn that is performing NLP. My goal would is to get a zip file resource from the classloader and unzip it on the worker file system only once as the worker is being initialized rather than trying to do this in the custom DoFn.

推荐答案

您可以指定 --filesToStage 来指定应该暂存的文件.有几个问题需要注意:

You can specify --filesToStage to specify files that should be staged. There are several issues to be aware of:

  1. 默认情况下,Dataflow SDK 将 --filesToStage 设置为类路径中的所有文件,以确保工作人员可以使用运行管道所需的代码.如果您覆盖此选项,则需要确保它包含您的代码.
  2. worker 上的文件(将在类路径中)将附加一个 MD5 哈希值.因此,如果您指定 --filesToStage=foo.zip,文件名将是 foo-<someHash>.zip.您需要遍历类路径中的所有文件以找到合适的文件.
  1. By default, the Dataflow SDK sets --filesToStage to all of the files in your classpath, which ensures that the code needed to run your pipeline is available to the worker. If you override this option you'll need to make sure that it includes your code.
  2. The files on the worker (which will be in the classpath) will have a MD5 hash appended to them. So if you specified --filesToStage=foo.zip, the file name would be foo-<someHash>.zip. You would need to iterate over all the files in the classpath to find the appropriate one.

请参阅 https 中关于 --filesToStage 的文档://cloud.google.com/dataflow/pipelines/executing-your-pipeline了解更多信息.

See the documentation on --filesToStage in https://cloud.google.com/dataflow/pipelines/executing-your-pipeline for some more info.

这篇关于在谷歌数据流工作器上暂存文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆