AWS EMR Spark:写入S3时出错-IllegalArgumentException-无法从空字符串创建路径 [英] AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string

查看:513
本文介绍了AWS EMR Spark:写入S3时出错-IllegalArgumentException-无法从空字符串创建路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经很长时间试图解决这个问题了……不知道为什么要得到这个?仅供参考,我正在AWS EMR集群上的集群上运行Spark.我调试了一下,清楚地看到了提供的目标路径……类似于s3://my-bucket-name/. spark作业将创建orc文件,并在创建分区后将它们写入:date=2017-06-10.有什么想法吗?

I have been trying to fix this for a long time now ... no idea why I get this? FYI, I'm running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destination path provided ... something like s3://my-bucket-name/. The spark job creates orc files and writes them after creating a partition like so: date=2017-06-10. Any ideas?

17/07/08 22:48:31 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
    at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
    at org.apache.hadoop.fs.Path.<init>(Path.java:134)
    at org.apache.hadoop.fs.Path.<init>(Path.java:93)
    at org.apache.hadoop.fs.Path.suffix(Path.java:361)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:138)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:82)

编写orc的代码:

dataframe.write
   .partitionBy(partition)
   .option("compression", ZLIB.toString)
   .mode(SaveMode.Overwrite)
   .orc(destination)

推荐答案

将镶木地板文件写入S3时,我遇到了类似的问题.问题是SaveMode.Overwrite.与S3结合使用时,此模式似乎无法正常工作.在写入S3存储桶my-bucket-name之前,请尝试删除所有数据.然后,您的代码应成功运行.

I have seen a similar problem when writing parquet files to S3. The problem is the SaveMode.Overwrite. This mode doesn't seem to work correctly in combination with S3. Try to delete all the data in your S3 bucket my-bucket-name before writing into it. Then your code should run successfully.

要从存储桶my-bucket-name中删除所有文件,可以使用以下pyspark代码:

To delete all files from your bucket my-bucket-name you can use the following pyspark code:

# see https://www.quora.com/How-do-you-overwrite-the-output-directory-when-using-PySpark
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem

# see http://crazyslate.com/how-to-rename-hadoop-files-using-wildcards-while-patterns/
fs = FileSystem.get(URI("s3a://my-bucket-name"), sc._jsc.hadoopConfiguration())
file_status = fs.globStatus(Path("/*"))
for status in file_status:
    fs.delete(status.getPath(), True)

这篇关于AWS EMR Spark:写入S3时出错-IllegalArgumentException-无法从空字符串创建路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆