使用gcloud cli执行具有多个输入/输出的Dataflow作业 [英] Executing a Dataflow job with multiple inputs/outputs using gcloud cli

查看:94
本文介绍了使用gcloud cli执行具有多个输入/输出的Dataflow作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Dataprep中设计了一个数据转换,现在正尝试使用Dataflow中的模板来运行它.我的流程有多个输入和输出-数据流模板将它们作为json对象提供,每个输入和输出都有键/值对.地点.它们看起来像这样(添加了换行符以便于阅读):

I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):

{
    "location1": "project:bq_dataset.bq_table1",
    #...
    "location10": "project:bq_dataset.bq_table10",
    "location17": "project:bq_dataset.bq_table17"
}

我有17个输入(主要是查找)和2个输出(一个csv,一个bigquery).我将它们像这样传递给gcloud CLI:

I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud CLI like this:

gcloud dataflow jobs run job-201807301630 /
    --gcs-location=gs://bucketname/dataprep/dataprep_template /
    --parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}

但是我遇到一个错误:

ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv

从错误消息看,它似乎正在合并输入和输出,因此当我有两个输出时,每两个输入都与两个输出配对:

From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:

input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...

我试过引用输入/输出对象(单和双,再加上对象中的引号),将它们包装在[]中,使用波浪号但不高兴.有没有人设法执行具有多个输入的数据流作业?

I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in [], using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?

推荐答案

我终于通过大量的反复试验找到了解决方案.涉及几个步骤.

I finally found a solution for this via a huge process of trial and error. There are several steps involved.

--parameters参数是字典类型的参数.您可以通过在CLI中键入gcloud topic escaping来阅读文档中的详细信息,但总之,这意味着您需要在--parameters和参数之间使用=,然后格式为key = value对值括在引号(")中:

The --parameters argument is a dictionary-type argument. There are details on these in a document you can read by typing gcloud topic escaping in the CLI, but in short it means you'll need an = between --parameters and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("):

--parameters=inputLocations="object",outputLocations="object"

转义对象

然后,对象需要使用引号转义,以避免过早结束该值,因此

Escape the objects

Then, the objects need the quotes escaping to avoid ending the value prematurely, so

{"location1":"gcs://bucket/whatever"...

成为

{\"location1\":\"gcs://bucket/whatever\"...

选择其他分隔符

接下来,CLI感到困惑,因为虽然键=值对之间用逗号分隔,但这些值在对象中也包含逗号.因此,您可以通过在参数开头的克拉(^)之间和key = value对之间放置一个分隔符来定义一个不同的分隔符:

Choose a different separator

Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (^) at the start of the argument and between the key=value pairs:

--parameters=^*^inputLocations="{"\location1\":\"...\"}"*outputLocations="{"\location1\":\"...\"}"

我使用*是因为;无法正常工作-也许是因为它标志着CLI命令的结束?谁知道.

I used * because ; didn't work - maybe because it marks the end of the CLI command? Who knows.

还请注意,gcloud topic escaping信息显示:

在Windows上的cmd.exe和PowerShell中,^是一个特殊字符, 您必须通过重复操作来逃脱它.在以下示例中,每次 您会看到^,将其替换为^^^^.

In cmd.exe and PowerShell on Windows, ^ is a special character and you must escape it by repeating it. In the following examples, every time you see ^, replace it with ^^^^.

别忘了customGcsTempLocation

毕竟,我忘记了customGcsTempLocation需要添加到--parameters参数中的键=值对.别忘了用*将其与其他字符分开,然后再次用引号将其括起来:

Don't forget customGcsTempLocation

After all that, I'd forgotten that customGcsTempLocation needs adding to the key=value pairs in the --parameters argument. Don't forget to separate it from the others with a * and enclose it in quote marks again:

...}*customGcsTempLocation="gs://bucket/whatever"

在线文档中几乎没有任何解释,所以这是我一生中不会回来的日子-希望我对此有所帮助.

Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.

这篇关于使用gcloud cli执行具有多个输入/输出的Dataflow作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆