虽然与pyspark提交作业时,如何访问静态文件上传与--files说法? [英] While submit job with pyspark, how to access static files upload with --files argument?

查看:2300
本文介绍了虽然与pyspark提交作业时,如何访问静态文件上传与--files说法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有一个文件夹:

for example, i have a folder:

/
  - test.py
  - test.yml

和作业submited引发与集群:

and the job is submited to spark cluster with:

gcloud测试Dataproc工作提出pyspark --files = test.ymltest.py

test.py ,我要访问我上传的静态文件。

in the test.py, I want to access the static file I uploaded.

with open('test.yml') as test_file:
    logging.info(test_file.read())

但得到以下异常:

but got the following exception:

IOError: [Errno 2] No such file or directory: 'test.yml'

如何访问我上传的文件?

How to access the file I uploaded?

推荐答案

使用分布式文件 SparkContext.addFile (和 - 文件)可以通过 SparkFiles 进行访问。它提供了两个方法:

Files distributed using SparkContext.addFile (and --files) can be accessed via SparkFiles. It provides two methods:


  • getDirectory() - 分布式文件返回根目录

  • GET(文件名) - 返回到文件绝对路径

  • getDirectory() - returns root directory for distributed files
  • get(filename) - returns absolute path to the file

我不知道是否有任何具体的Dataproc局限性,但这样的事情应该只是罚款:

I am not sure if there are any Dataproc specific limitations but something like this should work just fine:

from pyspark import SparkFiles

with open(SparkFiles.get('test.yml')) as test_file:
    logging.info(test_file.read())

这篇关于虽然与pyspark提交作业时,如何访问静态文件上传与--files说法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆