获取"org.apache.spark.sql.AnalysisException:路径不存在"来自SparkSession.read() [英] Getting "org.apache.spark.sql.AnalysisException: Path does not exist" from SparkSession.read()

查看:169
本文介绍了获取"org.apache.spark.sql.AnalysisException:路径不存在"来自SparkSession.read()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取由 spark-submit 提交到客户端模式下的纱线簇的文件.不能将文件放入HDFS.这是我所做的:

I am trying to read a file submitted by spark-submit to yarn cluster in client mode. Putting file in HDFS is not an option. Here's what I've done:

def main(args: Array[String]) {
   if (args != null && args.length > 0) {
        val inputfile: String = args(0)

        //get filename: train.csv
        val input_filename = inputfile.split("/").toList.last 

        val d = SparkSession.read
                .option("header", "true")
                .option("inferSchema", "true")
                .csv(SparkFiles.get(input_filename))
        d.show() 
   }   
}

并以这种方式提交给纱线:

and submitted to yarn this way:

spark2-submit \
--class "com.example.HelloWorld" \
--master yarn --deploy-mode client \
--files repo/data/train.csv \
--driver-cores 2 helloworld-assembly-0.1.jar repo/data/train.csv

但我有一个例外:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://xxxxx.xxxxx.xxxx.com:8020/tmp/spark-db3ee991-7f3d-427c-8479-aa212f906dc5/userFiles-040293ee-0d1f-44dd-ad22-ef6fe729bd49/train.csv; 

而且我也尝试过:

val input_filename_1 = """file://""" + SparkFiles.get(input_filename)
println(input_filename_1)

SparkSession.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv(input_filename_1) 

仍然出现类似的错误:

 file:///tmp/spark-fbd46e9d-c450-4f86-8b23-531e239d7b98/userFiles-8d129eb3-7edc-479d-aeda-2da98432fc50/train.csv
 Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/spark-fbd46e9d-c450-4f86-8b23-531e239d7b98/userFiles-8d129eb3-7edc-479d-aeda-2da98432fc50/train.csv;

推荐答案

我使用-files test.csv spark.sparkContext.addFile("test.csv)

spark.sparkContext.addFile("test.csv")
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("file://"+SparkFiles.get("test.csv"))

使用 scala>获得的文件;SparkFiles.get("test.csv")

Ex:/tmp/spark-9c4ea9a6-95d7-44ff-8cfb-1d9ce9f30638/userFiles-f8909daa-9710-4416-b0f0-9d9043db5d8c/test.csv 是在您在其中创建的本地文件系统上创建的提交工作.

Ex : /tmp/spark-9c4ea9a6-95d7-44ff-8cfb-1d9ce9f30638/userFiles-f8909daa-9710-4416-b0f0-9d9043db5d8c/test.csv is created on local file system where you submit the job.

因此工作人员没有要读取的文件.问题可能与使用 spark.read.csv

So workers do not have this file to read. Problem might be with using spark.read.csv

我尝试将本地创建的文件复制到其他节点.有效.

I tried copying locally created file to other nodes. It worked.

希望这会有所帮助.

这篇关于获取"org.apache.spark.sql.AnalysisException:路径不存在"来自SparkSession.read()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆