使用gcloud cli执行具有多个输入/输出的Dataflow作业 [英] Executing a Dataflow job with multiple inputs/outputs using gcloud cli
问题描述
我已经在Dataprep中设计了一个数据转换,现在正尝试使用Dataflow中的模板来运行它.我的流程有多个输入和输出-数据流模板将它们作为json对象提供,每个输入和输出都有键/值对.地点.它们看起来像这样(添加了换行符以便于阅读):
I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):
{
"location1": "project:bq_dataset.bq_table1",
#...
"location10": "project:bq_dataset.bq_table10",
"location17": "project:bq_dataset.bq_table17"
}
我有17个输入(主要是查找)和2个输出(一个csv,一个bigquery).我将它们像这样传递给gcloud
CLI:
I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud
CLI like this:
gcloud dataflow jobs run job-201807301630 /
--gcs-location=gs://bucketname/dataprep/dataprep_template /
--parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}
但是我遇到一个错误:
ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv
从错误消息看,它似乎正在合并输入和输出,因此当我有两个输出时,每两个输入都与两个输出配对:
From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:
input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...
我试过引用输入/输出对象(单和双,再加上对象中的引号),将它们包装在[]
中,使用波浪号但不高兴.有没有人设法执行具有多个输入的数据流作业?
I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in []
, using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?
推荐答案
我终于通过大量的反复试验找到了解决方案.涉及几个步骤.
I finally found a solution for this via a huge process of trial and error. There are several steps involved.
--parameters
参数是字典类型的参数.您可以通过在CLI中键入gcloud topic escaping
来阅读文档中的详细信息,但总之,这意味着您需要在--parameters
和参数之间使用=
,然后格式为key = value对值括在引号("
)中:
The --parameters
argument is a dictionary-type argument. There are details on these in a document you can read by typing gcloud topic escaping
in the CLI, but in short it means you'll need an =
between --parameters
and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("
):
--parameters=inputLocations="object",outputLocations="object"
转义对象
然后,对象需要使用引号转义,以避免过早结束该值,因此
Escape the objects
Then, the objects need the quotes escaping to avoid ending the value prematurely, so
{"location1":"gcs://bucket/whatever"...
成为
{\"location1\":\"gcs://bucket/whatever\"...
选择其他分隔符
接下来,CLI感到困惑,因为虽然键=值对之间用逗号分隔,但这些值在对象中也包含逗号.因此,您可以通过在参数开头的克拉(^
)之间和key = value对之间放置一个分隔符来定义一个不同的分隔符:
Choose a different separator
Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (^
) at the start of the argument and between the key=value pairs:
--parameters=^*^inputLocations="{"\location1\":\"...\"}"*outputLocations="{"\location1\":\"...\"}"
我使用*
是因为;
无法正常工作-也许是因为它标志着CLI命令的结束?谁知道.
I used *
because ;
didn't work - maybe because it marks the end of the CLI command? Who knows.
还请注意,gcloud topic escaping
信息显示:
在Windows上的cmd.exe和PowerShell中,^是一个特殊字符, 您必须通过重复操作来逃脱它.在以下示例中,每次 您会看到^,将其替换为^^^^.
In cmd.exe and PowerShell on Windows, ^ is a special character and you must escape it by repeating it. In the following examples, every time you see ^, replace it with ^^^^.
别忘了customGcsTempLocation
毕竟,我忘记了customGcsTempLocation
需要添加到--parameters
参数中的键=值对.别忘了用*
将其与其他字符分开,然后再次用引号将其括起来:
Don't forget customGcsTempLocation
After all that, I'd forgotten that customGcsTempLocation
needs adding to the key=value pairs in the --parameters
argument. Don't forget to separate it from the others with a *
and enclose it in quote marks again:
...}*customGcsTempLocation="gs://bucket/whatever"
在线文档中几乎没有任何解释,所以这是我一生中不会回来的日子-希望我对此有所帮助.
Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.
这篇关于使用gcloud cli执行具有多个输入/输出的Dataflow作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!