是否可以使用自定义程序包运行Cloud Dataflow? [英] Is it possible to run Cloud Dataflow with custom packages?

查看:73
本文介绍了是否可以使用自定义程序包运行Cloud Dataflow?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以为Dataflow Worker提供自定义包? 我想从计算中将Debian打包为二进制文件.

Is it possible to provision Dataflow workers with custom packages? I'd like to shell out to a Debian-packaged binary from inside a computation.

显然,程序包配置非常复杂,仅在--filesToStage中捆绑文件是不可行的. 该解决方案应包括在某个时候安装Debian软件包.

To be clear, the package configuration is sufficiently complex that it's not feasible to just bundle the files in --filesToStage. The solution should involve installing the Debian package at some point.

推荐答案

这不是Dataflow明确支持的内容.但是,以下是有关如何实现此操作的一些建议.请记住,服务中的某些情况可能会发生变化,将来可能会破坏这种情况.

This is not something Dataflow explicitly supports. However, below are some suggestions on how you could accomplish this. Please keep in mind that things could change in the service that could break this in the future.

有两个独立的问题:

  1. 将debian软件包安装到工作程序上.
  2. 安装debian软件包.

对于第一个问题,您可以使用--filesToStage并指定debian软件包的路径.这将导致程序包上载到GCS,然后在启动时下载到工作程序.如果使用此选项,则还必须在--filesToStage的值中包含所有jar,因为如果您明确设置--filesToStage,它们在默认情况下将不包括在内.

For the first problem you can use --filesToStage and specify the path to your debian package. This will cause the package to be uploaded to GCS and then downloaded to the worker on startup. If you use this option you must include in the value of --filesToStage all your jars as well since they will not be included by default if you explicitly set --filesToStage.

在Java worker上,传递给--filesToStage的任何文件都可以在以下目录(或其子目录)中使用

On the java worker any files passed in --filesToStage will be available in the following directories (or a subdirectory of)

/var/opt/google/dataflow

/dataflow/packages

您需要检查两个位置,以确保找到文件.

You would need to check both locations in order to be guaranteed of finding the file.

我们不保证这些目录将来不会更改.这些只是今天使用的位置.

要解决第二个问题,您可以覆盖 DoFn .在这里,您可以在/dataflow/packages中找到debian软件包,然后将其安装到命令行中并安装它.

To solve the second problem you can override StartBundle in your DoFn. From here you could shell out to the command line and install your debian package after finding it in /dataflow/packages.

DoFn可能有多个实例并排运行,因此,如果两个进程尝试同时安装软件包,则可能会出现争用问题.我不确定debian软件包系统是否可以处理此问题,或者您是否需要在代码中明确地做到这一点.

There could be multiple instances of your DoFn running side by side so you could get contention issues if two processes try to install your package simultaneously. I'm not sure if the debian package system can handle this or you need to so in your code explicitly.

这种方法的一个细微变化是不使用--filesToStage将软件包分发给您的工作人员,而是将代码添加到您的startBundle中以从某个位置获取它.

A slight variant of this approach is to not use --filesToStage to distribute the package to your workers but instead add code to your startBundle to fetch it from some location.

这篇关于是否可以使用自定义程序包运行Cloud Dataflow?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆