Pyspark - 加载文件:路径不存在 [英] Pyspark - Load file: Path does not exist

查看:60
本文介绍了Pyspark - 加载文件:路径不存在的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 的新手.我正在尝试读取 EMR 集群中的本地 csv 文件.该文件位于:/home/hadoop/.我正在使用的脚本是这样的:

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()\

df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)

当我运行脚本时出现以下错误消息:

When I run the script raises the following error message:

pyspark.sql.utils.AnalysisException:u'Path 不存在:hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

然后,我发现我必须在文件路径中添加 file://才能在本地读取文件:

Then, I found out that I have to add file:// in the file path so it can read the file locally:

df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)

但这一次,上述方法引发了不同的错误:

But this time, the above approach raised a different error:

在 0.0 阶段丢失任务 0.3 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal,执行者 1):java.io.FileNotFoundException:文件文件:/home/hadoop/observations_temp.csv 不存在

Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv does not exist

我认为是因为文件//扩展名只是在本地读取文件,而不会将文件分发到其他节点.

I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.

您知道如何读取 csv 文件并将其提供给所有其他节点吗?

Do you know how can I read the csv file and make it available to all the other nodes?

推荐答案

你是对的,你的文件从你的工作节点中丢失,因此引发了你得到的错误.

You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.

这里是官方文档参考.外部数据集.

如果使用本地文件系统上的路径,则该文件也必须可以在工作节点上的相同路径上访问.要么将文件复制到所有工作人员,要么使用网络安装的共享文件系统.

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

所以基本上你有两个解决方案:

So basically you have two solutions :

您在开始工作之前将您的文件复制到每个工作人员中;

You copy your file into each worker before starting the job;

或者您将使用以下内容在 HDFS 中上传:(推荐解决方案)

Or you'll upload in HDFS with something like : (recommended solution)

hadoop fs -put localfile /user/hadoop/hadoopfile.csv

现在你可以用:

df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)

您似乎也在使用 AWS S3.您始终可以尝试直接从 S3 读取它,而无需下载它.(当然有适当的凭据)

It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)

有些人建议使用 spark-submit 提供的 --files 标签将文件上传到执行目录.我不推荐这种方法,除非您的 csv 文件非常小,但您不需要 Spark.

Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.

或者,我会坚持使用 HDFS(或任何分布式文件系统).

Alternatively, I would stick with HDFS (or any distributed file system).

这篇关于Pyspark - 加载文件:路径不存在的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆