将 5G 文件上传到 Amazon S3 时出现 EntityTooLarge 错误 [英] EntityTooLarge error when uploading a 5G file to Amazon S3

查看:29
本文介绍了将 5G 文件上传到 Amazon S3 时出现 EntityTooLarge 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据此公告

Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file

'/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: 
  <?xml version="1.0" encoding="UTF-8"?>
  <Error>
    <Code>EntityTooLarge</Code>
    <Message>Your proposed upload exceeds the maximum allowed size</Message>
    <ProposedSize>5374138340</ProposedSize>
    ...
    <MaxSizeAllowed>5368709120</MaxSizeAllowed>
  </Error>

这使得 S3 似乎只接受 5G 上传.我正在使用 Apache Spark SQL 使用 SchemRDD.saveAsParquetFile 方法写出 Parquet 数据集.完整的堆栈跟踪是

This makes it seem like S3 is only accepting 5G uploads. I am using Apache Spark SQL to write out a Parquet data set using SchemRDD.saveAsParquetFile method. The full stack trace is

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>5374138340</ProposedSize><RequestId>20A38B479FFED879</RequestId><HostId>KxeGsPreQ0hO7mm7DTcGLiN7vi7nqT3Z6p2Nbx1aLULSEzp6X5Iu8Kj6qM7Whm56ciJ7uDEeNn4=</HostId><MaxSizeAllowed>5368709120</MaxSizeAllowed></Error>
        org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        java.lang.reflect.Method.invoke(Method.java:606)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        org.apache.hadoop.fs.s3native.$Proxy10.storeFile(Unknown Source)
        org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close(NativeS3FileSystem.java:174)
        org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
        parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
        parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
        org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)

上传限制还是5T吗?如果这是我收到此错误的原因,我该如何解决?

Is the upload limit still 5T? If it is why am I getting this error and how do I fix it?

推荐答案

对象 大小限制为 5 TB.上传大小仍然是 5 GB,如手册中所述:

The object size is limited to 5 TB. The upload size is still 5 GB, as explained in the manual:

根据您上传的数据大小,Amazon S3 提供以下选项:

Depending on the size of the data you are uploading, Amazon S3 offers the following options:

  • 在单个操作中上传对象 - 使用单个 PUT 操作,您可以上传最大 5 GB 的对象.

  • Upload objects in a single operation—With a single PUT operation you can upload objects up to 5 GB in size.

分段上传对象 - 使用分段上传 API,您可以上传最大 5 TB 的大型对象.

Upload objects in parts—Using the Multipart upload API you can upload large objects, up to 5 TB.

http://docs.aws.amazon.com/AmazonS3/最新/开发/UploadingObjects.html

一旦您进行分段上传,S3 会验证并重新组合各个部分,然后您将在 S3 中拥有一个对象,最大大小为 5TB,可以作为单个实体下载,使用单个 HTTP GET 请求...但上传速度可能要快得多,即使文件小于 5GB,因为您可以并行上传部分,甚至可以重试第一次尝试未成功的任何部分的上传.

Once you do a multipart upload, S3 validates and recombines the parts, and you then have a single object in S3, up to 5TB in size, that can be downloaded as a single entitity, with a single HTTP GET request... but uploading is potentially much faster, even on files smaller than 5GB, since you can upload the parts in parallel and even retry the uploads of any parts that didn't succeed on first attempt.

这篇关于将 5G 文件上传到 Amazon S3 时出现 EntityTooLarge 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆