火花执行-在驱动程序和执行程序中访问文件内容的一种方法 [英] spark execution - a single way to access file contents in both the driver and executors

查看：60 发布时间：2021/4/3 19:11:05 apache-spark pyspark amazon-emr

本文介绍了火花执行-在驱动程序和执行程序中访问文件内容的一种方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据此问题--pyspark中的文件选项不起作用 sc.addFiles选项应该适用于访问驱动程序和执行程序中的文件.但是我无法在执行程序上使用它

According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files in both the driver and executors. But I cannot get it to work on the executors

test.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("File access test")
sc = SparkContext(conf=conf)
sc.addFile("file:///home/hadoop/uploads/readme.txt")

with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines) # this works
print('********************')
lines = sc.textFile(SparkFiles.get('readme.txt')) # run in the executors. this errors
print(lines.collect())

命令

spark-submit --master yarn --deploy-mode client test.py

readme.txt 在主节点的/home/hadoop/uploads 下

我在日志中看到以下内容

I see the following in logs

21/01/27 15:03:30 INFO SparkContext: Added file file:///home/hadoop/uploads/readme.txt at spark://ip-10-133-70-121.sysco.net:44401/files/readme.txt with timestamp 1611759810247
21/01/27 15:03:30 INFO Utils: Copying /home/hadoop/uploads/readme.txt to /mnt/tmp/spark-f929a1e2-e7e8-401e-8e2e-dcd1def3ee7b/userFiles-fed4d5bf-3e31-4e1e-b2ae-3d4782ca265c/readme.txt

因此将其复制到一些spark目录并挂载(对于spark世界，我还是一个相对较新的人).如果我使用--files标志并传递文件，它还会将其复制到hdfs://路径，执行者可以读取该路径.

So its copying it to some spark directory and mount ( I am still relatively new to the spark world). If I use the --files flag and pass the file it also copies it to an hdfs:// path that can be read by the executors.

这是因为addFile要求文件也必须存在于本地执行程序中.当前 readme.txt 在主节点上.如果是这样，则有一种方法可以将其从主服务器传播到执行者.

Is this because the addFile requires the file to also be present on the executors locally. Currently the readme.txt is on the master node. If so is there a way to propagate it to executors from the master.

我正在尝试找到一种访问文件的统一方式.我能够将文件从本地计算机推送到主节点.但是，在Spark代码中，我想以一种单一的方式访问文件内容，无论是驱动程序还是执行程序

I am trying to find one uniform way of accessing the file. I am able to push the file from the local machine to master node. In the spark code however I would like a single way of accessing the contents of a file whether it be the driver or the executor

当前，为了使代码的执行器部分正常工作，我还必须在--files标志中传递文件()，并使用类似以下内容的

Currently for the executor part of the code to work I have to also pass the file in the --files flag (spark-submit --master yarn --deploy-mode client --files uploads/readme.txt test.py) and use something like the following

path = f'hdfs://{sc.getConf().get("spark.driver.host")}:8020/user/hadoop/.sparkStaging/{sc.getConf().get("spark.app.id")}/readme.txt'
lines = sc.textFile(path)

火花执行-在驱动程序和执行程序中访问文件内容的一种方法 [英] spark execution - a single way to access file contents in both the driver and executors

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

火花执行-在驱动程序和执行程序中访问文件内容的一种方法 [英] spark execution - a single way to access file contents in both the driver and executors

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭