将Typesafe配置conf文件传递到DataProcSparkOperator [英] Passing typesafe config conf files to DataProcSparkOperator

查看:108
本文介绍了将Typesafe配置conf文件传递到DataProcSparkOperator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Google dataproc提交Spark作业,并使用Google Cloud Composer安排它们。不幸的是,我遇到了困难。

I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties.

我依靠 .conf 文件(类型安全的配置文件)传递参数到我的火花工作。

I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs.

我正在为气流dataproc使用以下python代码:

I am using the following python code for the airflow dataproc:

t3 = dataproc_operator.DataProcSparkOperator(
   task_id ='execute_spark_job_cluster_test',
   dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar',
   cluster_name='cluster',
   main_class = 'com.organ.ingestion.Main',
   project_id='project',
   dataproc_spark_properties={'spark.driver.extraJavaOptions':'gs://file-dev/fileConf/development.conf'},
   scopes='https://www.googleapis.com/auth/cloud-platform', dag=dag)

但这无法正常工作,并且出现一些错误。

But this is not working and I am getting some errors.

有人可以帮我吗?

基本上,我希望能够覆盖 .conf 文件,并将它们作为参数传递给我的 DataProcSparkOperator

我也尝试这样做

Could anyone help me with this?
Basically I want to be able to override the .conf files and pass them as arguments to my DataProcSparkOperator.
I also tried to do

arguments=`'gs://file-dev/fileConf/development.conf'`: 

,但这没有考虑参数中提到的 .conf 文件。

but this didn't take into account the .conf file mentioned in the arguments .

推荐答案

tl; dr您需要将 development.conf 文件转换为字典,以传递给 dataproc_spark_properties

tl;dr You need to turn your development.conf file into a dictionary to pass to dataproc_spark_properties.

完整说明:

有两种主要方法设置属性-在群集级别和作业级别。

There are two main ways to set properties -- on the cluster level and on the job level.

1)作业级别

外观就像您尝试在工作级别上设置它们一样: DataProcSparkOperator(dataproc_spark_properties = {'foo':'bar','foo2':'bar2'})。这与 gcloud dataproc作业提交spark --properties foo = bar,foo2 = bar2 spark-submit --conf foo = bar相同- conf foo2 = bar2 。这是针对每个职位属性的文档

Looks like you are trying to set them on the job level: DataProcSparkOperator(dataproc_spark_properties={'foo': 'bar', 'foo2': 'bar2'}). That's the same as gcloud dataproc jobs submit spark --properties foo=bar,foo2=bar2 or spark-submit --conf foo=bar --conf foo2=bar2. Here is the documentation for per-job properties.

spark.driver.extraJavaOptions 的参数应该是传递给Java的命令行参数。例如, -verbose:gc

The argument to spark.driver.extraJavaOptions should be command line arguments you would pass to java. For example, -verbose:gc.

2)群集级别

还可以使用 DataprocClusterCreateOperator(properties = {'spark:foo':'bar','spark:foo2':'bar2'}),与 gcloud dataproc群集创建--properties spark:foo = bar,spark:foo2 = bar2 文档)。再次,您需要使用字典。

You can also set properties on a cluster level using DataprocClusterCreateOperator(properties={'spark:foo': 'bar', 'spark:foo2': 'bar2'}), which is the same as gcloud dataproc clusters create --properties spark:foo=bar,spark:foo2=bar2 (documentation). Again, you need to use a dictionary.

重要的是,如果您在集群级别指定属性,则需要为它们添加要添加属性的配置文件前缀至。如果使用 spark:foo = bar ,这意味着将 foo = bar 添加到 / etc /spark/conf/spark-defaults.conf yarn-site.xml 等有类似的前缀。

Importantly, if you specify properties at the cluster level, you need to prefix them with which config file you want to add the property to. If you use spark:foo=bar, that means add foo=bar to /etc/spark/conf/spark-defaults.conf. There are similar prefixes for yarn-site.xml, etc.

3)使用 .conf 集群级别的文件

3) Using your .conf file at the cluster level

如果您不想打开 .conf 文件放入字典,您也可以使用/etc/spark/conf/spark-defaults.conf //github.com/GoogleCloudPlatform/dataproc-initialization-actions/ rel = nofollow noreferrer>初始化操作,当您创建集群时。

If you don't want to turn your .conf file into a dictionary, you can also just append it to /etc/spark/conf/spark-defaults.conf using an initialization action when you create the cluster.

例如(未经测试):

#!/bin/bash
set -euxo pipefail

gsutil cp gs://path/to/my.conf .
cat my.conf >> /etc/spark/conf/spark-defaults.conf

请注意,您要附加而不是替换现有的配置文件,只是为了让您仅覆盖所需的配置。

Note that you want to append to rather than replace the existing config file, just so that you only override the configs you need to.

这篇关于将Typesafe配置conf文件传递到DataProcSparkOperator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆