Pyspark-加载文件:路径不存在 [英] Pyspark - Load file: Path does not exist

查看:591
本文介绍了Pyspark-加载文件:路径不存在的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的新手.我正在尝试读取EMR群集中的本地csv文件.该文件位于:/home/hadoop/.我正在使用的脚本是这样的:

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()\

df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)

当我运行脚本时,会出现以下错误消息:

When I run the script raises the following error message:

pyspark.sql.utils.AnalysisException:u'路径不存在: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

然后,我发现我必须在文件路径中添加file://,以便它可以在本地读取文件:

Then, I found out that I have to add file:// in the file path so it can read the file locally:

df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)

但是这一次,以上方法引发了另一个错误:

But this time, the above approach raised a different error:

阶段0.0中丢失了任务0.3(TID 3,
ip-172-31-41-81.eu-west-1.compute.internal,执行程序1): java.io.FileNotFoundException:文件 文件:/home/hadoop/observations_temp.csv不存在

Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv does not exist

我认为这是因为文件扩展名只是在本地读取文件,并且不会在其他节点之间分发文件.

I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.

您知道如何读取csv文件并将其提供给所有其他节点吗?

Do you know how can I read the csv file and make it available to all the other nodes?

推荐答案

您是正确的,因为您的工作程序节点中缺少文件,从而引发了错误.

You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.

这是官方文档参考.外部数据集.

如果在本地文件系统上使用路径,则还必须在工作节点上的同一路径上访问该文件.将文件复制给所有工作人员,或使用网络安装的共享文件系统.

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

因此,基本上,您有两种解决方案:

So basically you have two solutions :

在开始工作之前,您将文件复制到每个工作人员中;

You copy your file into each worker before starting the job;

或者您将在HDFS中上传类似:(推荐的解决方案)

Or you'll upload in HDFS with something like : (recommended solution)

hadoop fs -put localfile /user/hadoop/hadoopfile.csv

现在,您可以使用:

df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)

似乎您也在使用AWS S3.您始终可以尝试直接从S3读取它,而无需下载它. (当然有适当的凭据)

It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)

有人建议spark-submit附带的--files标记会将文件上传到执行目录.我不建议您使用这种方法,除非您的csv文件很小,但是您不需要Spar.

Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spar.

或者,我会坚持使用HDFS(或任何分布式文件系统).

Alternatively, I would stick with HDFS (or any distributed file system).

这篇关于Pyspark-加载文件:路径不存在的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆