为什么"sc.addFile"和"spark-submit --files"不将本地文件分发给所有工作人员? [英] Why are "sc.addFile" and "spark-submit --files" not distributing a local file to all workers?

查看:556
本文介绍了为什么"sc.addFile"和"spark-submit --files"不将本地文件分发给所有工作人员?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件"test.csv",我正尝试将其复制到集群中的所有节点上.

I have a CSV file "test.csv" that I'm trying to have copied to all nodes on the cluster.

我有一个4节点apache-spark 1.5.2独立集群.一个节点同时也有4个工作线程,其中既有主节点又有驱动程序.

I have a 4 node apache-spark 1.5.2 standalone cluster. There are 4 workers where one node also acts has master/driver as well as the worker.

如果我跑步:

$SPARK_HOME/bin/pyspark --files=./test.csv或从REPL接口中执行sc.addFile('file://' + '/local/path/to/test.csv')

$SPARK_HOME/bin/pyspark --files=./test.csv OR from within the REPL interface execute sc.addFile('file://' + '/local/path/to/test.csv')

我看到Spark日志如下:

I see spark log the following:

16/05/05 15:26:08 INFO Utils: Copying /local/path/to/test.csv to /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
16/05/05 15:26:08 INFO SparkContext: Added file file:/local/path/to/test.csv at http://192.168.1.4:39578/files/test.csv with timestamp 1462461968158

在主节点/驱动程序节点上的另一个窗口中,我可以使用ls(即(ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv))轻松找到文件.

In a separate window on the master/driver node, I can easily locate the file using ls, i.e. (ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv).

但是,如果我登录工作人员,在/tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv上将没有文件,在/tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b上甚至没有文件夹

However if I log into the the workers, there is no file at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv and not even a folder at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b

但是apache spark Web界面显示了一个正在运行的作业以及已在所有节点上分配的核心,并且控制台中也没有其他警告或错误出现.

But the apache spark web interface shows a job running and cores allocated on all nodes, also no other warnings or errors appear in the console.

推荐答案

正如丹尼尔(Daniel)所说,每个工作人员对文件的管理方式都不同.如果要访问添加的文件,则可以使用 SparkFiles.get(file) .如果要查看文件将转到哪个目录,则可以打印SparkFiles.getDirectory(现在为SparkFiles.getRootDirectory)

As Daniel commented, each worker manages files differently. If you want to access the added file, then you can use SparkFiles.get(file). If you want to see which directory your files are going to, then you can print the output of SparkFiles.getDirectory (now SparkFiles.getRootDirectory)

这篇关于为什么"sc.addFile"和"spark-submit --files"不将本地文件分发给所有工作人员?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆