火花执行-在驱动程序和执行程序中访问文件内容的一种方法 [英] spark execution - a single way to access file contents in both the driver and executors

查看:60
本文介绍了火花执行-在驱动程序和执行程序中访问文件内容的一种方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据此问题--pyspark中的文件选项不起作用 sc.addFiles选项应该适用于访问驱动程序和执行程序中的文件.但是我无法在执行程序上使用它

According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files in both the driver and executors. But I cannot get it to work on the executors

test.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("File access test")
sc = SparkContext(conf=conf)
sc.addFile("file:///home/hadoop/uploads/readme.txt")

with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines) # this works
print('********************')
lines = sc.textFile(SparkFiles.get('readme.txt')) # run in the executors. this errors
print(lines.collect())

命令

spark-submit --master yarn --deploy-mode client test.py

readme.txt 在主节点的/home/hadoop/uploads

我在日志中看到以下内容

I see the following in logs

21/01/27 15:03:30 INFO SparkContext: Added file file:///home/hadoop/uploads/readme.txt at spark://ip-10-133-70-121.sysco.net:44401/files/readme.txt with timestamp 1611759810247
21/01/27 15:03:30 INFO Utils: Copying /home/hadoop/uploads/readme.txt to /mnt/tmp/spark-f929a1e2-e7e8-401e-8e2e-dcd1def3ee7b/userFiles-fed4d5bf-3e31-4e1e-b2ae-3d4782ca265c/readme.txt

因此将其复制到一些spark目录并挂载(对于spark世界,我还是一个相对较新的人).如果我使用--files标志并传递文件,它还会将其复制到hdfs://路径,执行者可以读取该路径.

So its copying it to some spark directory and mount ( I am still relatively new to the spark world). If I use the --files flag and pass the file it also copies it to an hdfs:// path that can be read by the executors.

这是因为addFile要求文件也必须存在于本地执行程序中.当前 readme.txt 在主节点上.如果是这样,则有一种方法可以将其从主服务器传播到执行者.

Is this because the addFile requires the file to also be present on the executors locally. Currently the readme.txt is on the master node. If so is there a way to propagate it to executors from the master.

我正在尝试找到一种访问文件的统一方式.我能够将文件从本地计算机推送到主节点.但是,在Spark代码中,我想以一种单一的方式访问文件内容,无论是驱动程序还是执行程序

I am trying to find one uniform way of accessing the file. I am able to push the file from the local machine to master node. In the spark code however I would like a single way of accessing the contents of a file whether it be the driver or the executor

当前,为了使代码的执行器部分正常工作,我还必须在--files标志中传递文件(),并使用类似以下内容的

Currently for the executor part of the code to work I have to also pass the file in the --files flag (spark-submit --master yarn --deploy-mode client --files uploads/readme.txt test.py) and use something like the following

path = f'hdfs://{sc.getConf().get("spark.driver.host")}:8020/user/hadoop/.sparkStaging/{sc.getConf().get("spark.app.id")}/readme.txt'
lines = sc.textFile(path)

推荐答案

一种方法是将代码文件放在s3存储桶中,然后指向spark提交中的文件位置.在这种情况下,所有工作节点将从s3获取相同的文件.

One way you can do this is by putting the code files on an s3 bucket and then pointing to the file locations in your spark submit. In that case, all the worker nodes will get the same file from s3.

确保您的EMR节点可以访问该s3存储桶.

Make sure that your EMR nodes have access to that s3 bucket.

这篇关于火花执行-在驱动程序和执行程序中访问文件内容的一种方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆