从外部文件加载配置的最佳PySpark做法是什么 [英] What is the best PySpark practice to load config from external file

查看:140
本文介绍了从外部文件加载配置的最佳PySpark做法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想一次初始化配置,然后在我的PySpark项目的许多模块中使用它。

I would like to initialize config once, and then use it in many modules of my PySpark project.

我看到两种方法。


  1. 将其加载到入口点并作为参数传递给每个函数

main.py:

with open(sys.argv[1]) as f:
    config = json.load(f)
df = load_df(config)
df = parse(df, config)
df = validate(df, config, strict=True)
dump(df, config)

但是将一个外部参数传递给每个函数似乎并不美观。

But it seems unbeauty to pass one external argument to each function.


  1. 在config.py中加载配置并将该对象导入每个模块中

config.py

config.py

import sys
import json
with open(sys.argv[1]) as f:
    config = json.load(f)

main.py

from config import config
df = load_df()
df = parse(df)
df = validate(df, strict=True)
dump(df)

,并在每个模块中添加行

and in each module add row

from config import config

看起来更美,因为严格来讲config不是函数的参数。

It seems more beauty because config is not, strictly speaking, an argument of function. It is general context where they execute.

不幸的是,PySpark pickle config.py试图在服务器上执行,但是没有将sys.argv传递给他们!
所以,我在运行它时看到错误

Unfortunately, PySpark pickle config.py and tries to execute it on server, but doesn't pass sys.argv to them! So, I see error when run it

  File "/PycharmProjects/spark_test/config.py", line 6, in <module>
    CONFIG_PATH = sys.argv[1]
IndexError: list index out of range

在PySpark中使用从文件加载的常规配置的最佳实践是什么?

What is the best practice to work with general config, loaded from file, in PySpark?

推荐答案

您的程序开始于通过调用执行器上的某些功能来掌握并将其大部分工作传递给执行器。执行程序是通常在不同的物理机器上运行的不同进程。

Your program starts execution on master and passes main bulk of its work to executors by invoking some functions on them. The executors are different processes that are typically run on different physical machines.

因此,主机要在执行程序上引用的任何内容都必须是标准库函数(

Thus anything that the master would want to reference on executors needs to be either a standard library function (to which the executors have access) or a pickelable object that can be sent over.

您通常不希望在执行者上加载和解析任何外部资源,因为您总是必须将它们复制过来并确保正确加载它们。。。将可拾取对象作为函数的参数(例如,对于UDF)传递要好得多,因为代码中只有一个地方需要

You typically don't want to load and parse any external resources on the executors, since you would always have to copy them over and make sure you load them properly... Passing a pickelable object as an argument of the function (e.g. for a UDF) works much better, since there is only one place in your code where you need to load it.

我建议创建一个 config.py 文件并将其作为自变量添加到您的 spark-submit 命令:

I would suggest creating a config.py file and add it as an argument to your spark-submit command:

spark-submit --py-files /path/to/config.py main_program.py

然后您可以创建以下火花上下文:

Then you can create spark context like this:

spark_context = SparkContext(pyFiles=['/path/to/config.py'])

在任何需要的地方使用 import config

and simply use import config wherever you need.

您甚至可以将整个python软件包包括在打包为单个zip的树中。文件,而不是单个 config.py 文件,但请确保在每个包含以下文件的文件夹中包含 __ init __。py 需要被引用为python模块。

You can even include whole python packages in a tree packaged as a single zip file instead of just a single config.py file, but then be sure to include __init__.py in every folder that needs to be referenced as python module.

这篇关于从外部文件加载配置的最佳PySpark做法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆