pyspark:找不到本地文件 [英] pyspark: couldn't find the local file

查看：266 发布时间：2020/11/6 4:23:07 python hadoop apache-spark filesystems pyspark

本文介绍了pyspark:找不到本地文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下简单的python代码:

I have the following simple python code:

from __future__ import print_function

import sys
from operator import add

from pyspark import SparkContext


if __name__ == "__main__":
    print(len(sys.argv))
    if len(sys.argv) < 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile(sys.argv[2], 1)
    counts = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    sc.stop()

然后我尝试通过以下操作在本地群集上运行它:

Then I tried to run it on a local cluster by doing:

spark-submit --master spark://rws-lnx-sprk01:7077 /home/edamameQ/wordcount.py wordcount /home/edamameQ/wordTest.txt

wordTest.txt绝对可用:

the wordTest.txt is definitely available:

edamameQ@spark-cluster:~$ ls
data    jars   myJob.txt  wordTest.txt  wordcount.py

但我不断收到错误消息:

But I keep getting the errors:

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 :
 :
Caused by: java.io.FileNotFoundException: File file:/home/edamameQ/wordTest.txt does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)

同一代码在AWS上使用s3位置的输入文件.为了在本地群集上运行，我需要进行任何调整吗?谢谢！

The same code was working on AWS with the input file from a s3 location. Is there anything I need to adjust for running on a local cluster? Thanks!

pyspark:找不到本地文件 [英] pyspark: couldn't find the local file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyspark:找不到本地文件 [英] pyspark: couldn&#39;t find the local file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

pyspark:找不到本地文件 [英] pyspark: couldn't find the local file

登录关闭