在Google数据流工作者上登台文件 [英] Staging files on google dataflow worker

查看:134
本文介绍了在Google数据流工作者上登台文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Dataflow SDK中有什么可以让我在工作人员上处理资源文件?我需要在文件系统上为执行NLP的自定义DoFn提供特定的静态文件资源。我的目标是从类加载器中获取一个zip文件资源,并在工作人员文件系统上将其解压缩一次,而不是在自定义DoFn中执行此操作。

- filesToStage 来指定应该暂存的文件。有几个问题需要注意:


  1. 默认情况下,Dataflow SDK设置 - filesToStage 添加到类路径中的所有文件,这可确保工作人员可以使用运行管线所需的代码。如果您重写此选项,则需要确保它包含您的代码。
  2. worker上的文件(将位于类路径中)将附加一个MD5散列。因此,如果您指定 - filesToStage = foo.zip ,则文件名将为 foo-< someHash> .zip 。您需要遍历类路径中的所有文件以找到合适的文件。 > - filesToStage 位于 https://cloud.google .com / dataflow / pipeline /执行你的管道
    了解更多信息。


    Is there anything in the Dataflow SDK that would allow me to stage resource files on a worker? I have specific static file resources that I need to make available on the file system for a custom DoFn that is performing NLP. My goal would is to get a zip file resource from the classloader and unzip it on the worker file system only once as the worker is being initialized rather than trying to do this in the custom DoFn.

    解决方案

    You can specify --filesToStage to specify files that should be staged. There are several issues to be aware of:

    1. By default, the Dataflow SDK sets --filesToStage to all of the files in your classpath, which ensures that the code needed to run your pipeline is available to the worker. If you override this option you'll need to make sure that it includes your code.
    2. The files on the worker (which will be in the classpath) will have a MD5 hash appended to them. So if you specified --filesToStage=foo.zip, the file name would be foo-<someHash>.zip. You would need to iterate over all the files in the classpath to find the appropriate one.

    See the documentation on --filesToStage in https://cloud.google.com/dataflow/pipelines/executing-your-pipeline for some more info.

    这篇关于在Google数据流工作者上登台文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆