虽然与pyspark提交作业时,如何访问静态文件上传与--files说法? [英] While submit job with pyspark, how to access static files upload with --files argument?
本文介绍了虽然与pyspark提交作业时,如何访问静态文件上传与--files说法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
例如,我有一个文件夹:
for example, i have a folder:
/
- test.py
- test.yml
和作业submited引发与集群:
and the job is submited to spark cluster with:
gcloud测试Dataproc工作提出pyspark --files = test.ymltest.py
在 test.py
,我要访问我上传的静态文件。
in the test.py
, I want to access the static file I uploaded.
with open('test.yml') as test_file:
logging.info(test_file.read())
但得到以下异常:
but got the following exception:
IOError: [Errno 2] No such file or directory: 'test.yml'
如何访问我上传的文件?
How to access the file I uploaded?
推荐答案
使用分布式文件 SparkContext.addFile
(和 - 文件
)可以通过 SparkFiles
进行访问。它提供了两个方法:
Files distributed using SparkContext.addFile
(and --files
) can be accessed via SparkFiles
. It provides two methods:
-
getDirectory()
- 分布式文件返回根目录 -
GET(文件名)
- 返回到文件绝对路径
getDirectory()
- returns root directory for distributed filesget(filename)
- returns absolute path to the file
我不知道是否有任何具体的Dataproc局限性,但这样的事情应该只是罚款:
I am not sure if there are any Dataproc specific limitations but something like this should work just fine:
from pyspark import SparkFiles
with open(SparkFiles.get('test.yml')) as test_file:
logging.info(test_file.read())
这篇关于虽然与pyspark提交作业时,如何访问静态文件上传与--files说法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文