AWS EMR Spark:写入S3时出错-IllegalArgumentException-无法从空字符串创建路径 [英] AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string
问题描述
我已经很长时间试图解决这个问题了……不知道为什么要得到这个?仅供参考,我正在AWS EMR集群上的集群上运行Spark.我调试了一下,清楚地看到了提供的目标路径……类似于s3://my-bucket-name/
. spark作业将创建orc文件,并在创建分区后将它们写入:date=2017-06-10
.有什么想法吗?
I have been trying to fix this for a long time now ... no idea why I get this? FYI, I'm running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destination path provided ... something like s3://my-bucket-name/
. The spark job creates orc files and writes them after creating a partition like so: date=2017-06-10
. Any ideas?
17/07/08 22:48:31 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.<init>(Path.java:134)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Path.suffix(Path.java:361)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:138)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:82)
编写orc的代码:
dataframe.write
.partitionBy(partition)
.option("compression", ZLIB.toString)
.mode(SaveMode.Overwrite)
.orc(destination)
推荐答案
将镶木地板文件写入S3时,我遇到了类似的问题.问题是SaveMode.Overwrite
.与S3结合使用时,此模式似乎无法正常工作.在写入S3存储桶my-bucket-name
之前,请尝试删除所有数据.然后,您的代码应成功运行.
I have seen a similar problem when writing parquet files to S3. The problem is the SaveMode.Overwrite
. This mode doesn't seem to work correctly in combination with S3. Try to delete all the data in your S3 bucket my-bucket-name
before writing into it. Then your code should run successfully.
要从存储桶my-bucket-name
中删除所有文件,可以使用以下pyspark代码:
To delete all files from your bucket my-bucket-name
you can use the following pyspark code:
# see https://www.quora.com/How-do-you-overwrite-the-output-directory-when-using-PySpark
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
# see http://crazyslate.com/how-to-rename-hadoop-files-using-wildcards-while-patterns/
fs = FileSystem.get(URI("s3a://my-bucket-name"), sc._jsc.hadoopConfiguration())
file_status = fs.globStatus(Path("/*"))
for status in file_status:
fs.delete(status.getPath(), True)
这篇关于AWS EMR Spark:写入S3时出错-IllegalArgumentException-无法从空字符串创建路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!