为什么"sc.addFile"和"spark-submit --files"不将本地文件分发给所有工作人员? [英] Why are "sc.addFile" and "spark-submit --files" not distributing a local file to all workers?
问题描述
我有一个CSV文件"test.csv",我正尝试将其复制到集群中的所有节点上.
I have a CSV file "test.csv" that I'm trying to have copied to all nodes on the cluster.
我有一个4节点apache-spark 1.5.2独立集群.一个节点同时也有4个工作线程,其中既有主节点又有驱动程序.
I have a 4 node apache-spark 1.5.2 standalone cluster. There are 4 workers where one node also acts has master/driver as well as the worker.
如果我跑步:
$SPARK_HOME/bin/pyspark --files=./test.csv
或从REPL接口中执行sc.addFile('file://' + '/local/path/to/test.csv')
$SPARK_HOME/bin/pyspark --files=./test.csv
OR from within the REPL interface execute sc.addFile('file://' + '/local/path/to/test.csv')
我看到Spark日志如下:
I see spark log the following:
16/05/05 15:26:08 INFO Utils: Copying /local/path/to/test.csv to /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
16/05/05 15:26:08 INFO SparkContext: Added file file:/local/path/to/test.csv at http://192.168.1.4:39578/files/test.csv with timestamp 1462461968158
在主节点/驱动程序节点上的另一个窗口中,我可以使用ls(即(ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
))轻松找到文件.
In a separate window on the master/driver node, I can easily locate the file using ls, i.e. (ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
).
但是,如果我登录工作人员,在/tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
上将没有文件,在/tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b
上甚至没有文件夹
However if I log into the the workers, there is no file at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
and not even a folder at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b
但是apache spark Web界面显示了一个正在运行的作业以及已在所有节点上分配的核心,并且控制台中也没有其他警告或错误出现.
But the apache spark web interface shows a job running and cores allocated on all nodes, also no other warnings or errors appear in the console.
推荐答案
正如丹尼尔(Daniel)所说,每个工作人员对文件的管理方式都不同.如果要访问添加的文件,则可以使用 SparkFiles.get(file)
.如果要查看文件将转到哪个目录,则可以打印SparkFiles.getDirectory
(现在为SparkFiles.getRootDirectory
)
As Daniel commented, each worker manages files differently. If you want to access the added file, then you can use SparkFiles.get(file)
. If you want to see which directory your files are going to, then you can print the output of SparkFiles.getDirectory
(now SparkFiles.getRootDirectory
)
这篇关于为什么"sc.addFile"和"spark-submit --files"不将本地文件分发给所有工作人员?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!