Spark错误读取csv文件在路径/文件名中带有空格 [英] Spark Error reading csv file with spaces in the path/file name
问题描述
我想使用spark读取csv文件.文件的路径包含空格.Spark正在用%20
替换空格.
I want to read a csv file using spark. The file's path has blank spaces. Spark is replacing the blank spaces with %20
.
这是代码:
val tmpDF = spark.read.format("com.databricks.spark.csv").option("multiLine", value = true).option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").option("delimiter", delimiter).load(filename)
tmpDF.show(10)
因此,当执行 tmpDF.show(10)
方法时,将引发以下错误:
So when the tmpDF.show(10)
method is executed the following error is thrown:
java.io.FileNotFoundException: No such file or directory: s3://{bucket_name}/all/Proposal%20and%20pre-approval/filen_name_20190826-215950.csv
底层文件可能已更新.您可以通过在SQL中运行 REFRESH TABLE tableName
命令或通过重新创建所涉及的Dataset/DataFrame来显式使Spark中的缓存无效."
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running REFRESH TABLE tableName
command in SQL or by recreating the Dataset/DataFrame involved."
我在s3中签入,文件确实存在,但是路径具有规则的空格,而不是%20
.
I checked in s3 and the file does exist but the path has a regular space instead of %20
.
有人知道如何处理吗?我无法更改路径,因为它们是由我无法修改的组件产生的.
Any idea how to handle this? I can't change the paths because they are produced by a component that I can't modify.
推荐答案
这是url编码的典型问题.来自S3的URL编码为%20.但是,spark错误地将其解码.
This is the typical problem of url encoding. The URL coming from S3 is encoded with %20. However, spark incorrectly decodes that.
与此有关的有两个问题
该问题已在spark2.3版本中解决.如果您使用的是旧版本
The issues have been resolved in spark2.3 version. If you are using older version
您需要在对网址进行解码后对文件名进行转义.
You need to escape the file names after decode the url.
这篇关于Spark错误读取csv文件在路径/文件名中带有空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!