Google数据流可以使用现有的VM,而不使用临时创建的VM吗? [英] Can Google Data flow use existent VM and not temporary created ones?

查看:75
本文介绍了Google数据流可以使用现有的VM,而不使用临时创建的VM吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与标题相同,Dataflow可以使用临时创建的VM实例而不使用临时创建的VM实例吗?

Same as title, can Dataflow use not temporary created VM instances but the one already made?

推荐答案

向OP询问请求的原因后,将其作为答复,然后提供以下可能的答案:

After asking the OP about the reason for the request which was then pointed to as a reply, I am going to offer the following as potential answer:

Dataflow的强大功能是在处理数据管道时实现高度的并行性.原始请求的背景是,某事"在以本地运行程序运行时有效,但在使用Dataflow作为运行程序时却无法正常运行.然后,这似乎导致OP认为我们将只使用本地运行器运行Dataflow".我认为这不是一个好主意.一个使用localrunner进行开发和单元测试.本地运行程序不提供任何形式的水平缩放……它实际上只能在一台机器上运行.

The power behind Dataflow is to achieve a high degree of parallelism when processing data pipelines. The back-story of the original request was that "something" was working when run as a local runner but not working as desired when using Dataflow as a runner. This then appears to have resulted in the OP thinking "we'll just run Dataflow using the local runner". In my opinion, that isn't a great idea. One uses the localrunner for development and unit testing. A local runner doesn't provide any form of horizontal scaling ... it literally runs on just one machine.

当一个人在分布式Dataflow上运行管道作业时,它会根据需要创建任意数量的工作程序,以在多个计算机之间合理地分配作业.如果作业随后希望生成结果作为文件输出...,那么问题将变为该数据将写入哪里?".答案不能是相对于运行Dataflow作业的位置的本地文件,因为根据定义,该作业是在多台计算机上运行的,并且没有将一台计算机作为输出"的概念.要解决此问题,应将数据输出到Google Cloud Storage,这是所有计算机都可以看到的公共存储区域. OP提出的相关问题描述了将数据写入GCS而不是本地文件(通过本地运行程序发现)的潜在问题,但我相信 是要解决的问题(即,如何编写)正确地集中化GCS存储),而不是尝试使用单个VM.数据流对数据流处理引擎(工作人员)的性质提供零控制.它们在逻辑上是短暂的,并且在那里"可以处理Dataflow工作.

When one runs a pipeline job on distributed Dataflow, it creates as many workers as needed to sensibly distribute the job across many machines. If the job then wishes to generate a result as file output ... the question then becomes "Where will that data be written?". The answer can't be a local file relative to where the Dataflow job was run because, by definition, the job was run across multiple machines and there is no notion of one machine as the "output". To solve this problem, data should be output to Google Cloud Storage which is a common storage area visible to all machines. The related question posed by the OP describes a potential problem with writing data to GCS as opposed to local file (as found with local runner) but I believe that is the problem to be solved (i.e. how to write to centralized GCS storage correctly) rather than try and use a single VM. Dataflow provides ZERO control over nature of the dataflow processing engines (workers). They are logically ephemeral and are "just there" to process Dataflow work.

这篇关于Google数据流可以使用现有的VM,而不使用临时创建的VM吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆