pyspark:找不到本地文件 [英] pyspark: couldn't find the local file
本文介绍了pyspark:找不到本地文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下简单的python代码:
I have the following simple python code:
from __future__ import print_function
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
print(len(sys.argv))
if len(sys.argv) < 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[2], 1)
counts = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
然后我尝试通过以下操作在本地群集上运行它:
Then I tried to run it on a local cluster by doing:
spark-submit --master spark://rws-lnx-sprk01:7077 /home/edamameQ/wordcount.py wordcount /home/edamameQ/wordTest.txt
wordTest.txt绝对可用:
the wordTest.txt is definitely available:
edamameQ@spark-cluster:~$ ls
data jars myJob.txt wordTest.txt wordcount.py
但我不断收到错误消息:
But I keep getting the errors:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
:
:
Caused by: java.io.FileNotFoundException: File file:/home/edamameQ/wordTest.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
同一代码在AWS上使用s3位置的输入文件.为了在本地群集上运行,我需要进行任何调整吗?谢谢!
The same code was working on AWS with the input file from a s3 location. Is there anything I need to adjust for running on a local cluster? Thanks!
推荐答案
要读取的文件必须在所有工作程序上都可以访问.如果这是本地文件,则唯一的选择是在每台工作计算机上保留一份副本.
File you want to read has to be accessible on all the workers. If this is a local file the only option is to keep a copy per worker machine.
这篇关于pyspark:找不到本地文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文