Spark错误读取csv文件在路径/文件名中带有空格 [英] Spark Error reading csv file with spaces in the path/file name

查看:163
本文介绍了Spark错误读取csv文件在路径/文件名中带有空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用spark读取csv文件.文件的路径包含空格.Spark正在用%20 替换空格.

I want to read a csv file using spark. The file's path has blank spaces. Spark is replacing the blank spaces with %20.

这是代码:

val tmpDF = spark.read.format("com.databricks.spark.csv").option("multiLine", value = true).option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").option("delimiter", delimiter).load(filename)

tmpDF.show(10)

因此,当执行 tmpDF.show(10)方法时,将引发以下错误:

So when the tmpDF.show(10) method is executed the following error is thrown:

java.io.FileNotFoundException: No such file or directory: s3://{bucket_name}/all/Proposal%20and%20pre-approval/filen_name_20190826-215950.csv 

底层文件可能已更新.您可以通过在SQL中运行 REFRESH TABLE tableName 命令或通过重新创建所涉及的Dataset/DataFrame来显式使Spark中的缓存无效."

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running REFRESH TABLE tableName command in SQL or by recreating the Dataset/DataFrame involved."

我在s3中签入,文件确实存在,但是路径具有规则的空格,而不是%20 .

I checked in s3 and the file does exist but the path has a regular space instead of %20.

有人知道如何处理吗?我无法更改路径,因为它们是由我无法修改的组件产生的.

Any idea how to handle this? I can't change the paths because they are produced by a component that I can't modify.

推荐答案

这是url编码的典型问题.来自S3的URL编码为%20.但是,spark错误地将其解码.

This is the typical problem of url encoding. The URL coming from S3 is encoded with %20. However, spark incorrectly decodes that.

与此有关的有两个问题

  1. https://jira.apache.org/jira/browse/SPARK-23148
  2. https://jira.apache.org/jira/browse/SPARK-24320

该问题已在spark2.3版本中解决.如果您使用的是旧版本

The issues have been resolved in spark2.3 version. If you are using older version

您需要在对网址进行解码后对文件名进行转义.

You need to escape the file names after decode the url.

这篇关于Spark错误读取csv文件在路径/文件名中带有空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆