火花执行-在驱动程序和执行程序中访问文件内容的一种方法 [英] spark execution - a single way to access file contents in both the driver and executors
问题描述
根据此问题--pyspark中的文件选项不起作用 sc.addFiles选项应该适用于访问驱动程序和执行程序中的文件.但是我无法在执行程序上使用它
According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files in both the driver and executors. But I cannot get it to work on the executors
test.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
conf = SparkConf().setAppName("File access test")
sc = SparkContext(conf=conf)
sc.addFile("file:///home/hadoop/uploads/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines) # this works
print('********************')
lines = sc.textFile(SparkFiles.get('readme.txt')) # run in the executors. this errors
print(lines.collect())
命令
spark-submit --master yarn --deploy-mode client test.py
readme.txt
在主节点的/home/hadoop/uploads
下
我在日志中看到以下内容
I see the following in logs
21/01/27 15:03:30 INFO SparkContext: Added file file:///home/hadoop/uploads/readme.txt at spark://ip-10-133-70-121.sysco.net:44401/files/readme.txt with timestamp 1611759810247
21/01/27 15:03:30 INFO Utils: Copying /home/hadoop/uploads/readme.txt to /mnt/tmp/spark-f929a1e2-e7e8-401e-8e2e-dcd1def3ee7b/userFiles-fed4d5bf-3e31-4e1e-b2ae-3d4782ca265c/readme.txt
因此将其复制到一些spark目录并挂载(对于spark世界,我还是一个相对较新的人).如果我使用--files标志并传递文件,它还会将其复制到hdfs://路径,执行者可以读取该路径.
So its copying it to some spark directory and mount ( I am still relatively new to the spark world). If I use the --files flag and pass the file it also copies it to an hdfs:// path that can be read by the executors.
这是因为addFile要求文件也必须存在于本地执行程序中.当前 readme.txt
在主节点上.如果是这样,则有一种方法可以将其从主服务器传播到执行者.
Is this because the addFile requires the file to also be present on the executors locally. Currently the readme.txt
is on the master node. If so is there a way to propagate it to executors from the master.
我正在尝试找到一种访问文件的统一方式.我能够将文件从本地计算机推送到主节点.但是,在Spark代码中,我想以一种单一的方式访问文件内容,无论是驱动程序还是执行程序
I am trying to find one uniform way of accessing the file. I am able to push the file from the local machine to master node. In the spark code however I would like a single way of accessing the contents of a file whether it be the driver or the executor
当前,为了使代码的执行器部分正常工作,我还必须在--files标志中传递文件(
Currently for the executor part of the code to work I have to also pass the file in the --files flag (spark-submit --master yarn --deploy-mode client --files uploads/readme.txt test.py
) and use something like the following
path = f'hdfs://{sc.getConf().get("spark.driver.host")}:8020/user/hadoop/.sparkStaging/{sc.getConf().get("spark.app.id")}/readme.txt'
lines = sc.textFile(path)
推荐答案
一种方法是将代码文件放在s3存储桶中,然后指向spark提交中的文件位置.在这种情况下,所有工作节点将从s3获取相同的文件.
One way you can do this is by putting the code files on an s3 bucket and then pointing to the file locations in your spark submit. In that case, all the worker nodes will get the same file from s3.
确保您的EMR节点可以访问该s3存储桶.
Make sure that your EMR nodes have access to that s3 bucket.
这篇关于火花执行-在驱动程序和执行程序中访问文件内容的一种方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!