使用 Java SDK 将多个文件批处理到 Amazon S3 [英] Batching multiple files to Amazon S3 using the Java SDK

查看：26 发布时间：2021/10/27 19:12:48 java amazon-web-services hadoop amazon-s3 aws-java-sdk

本文介绍了使用 Java SDK 将多个文件批处理到 Amazon S3的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过附加文件将多个文件上传到 Amazon S3，所有文件都使用相同的密钥.我有一个文件名列表，并希望按该顺序上传/附加文件.我几乎完全按照本教程但我正在循环首先每个文件，然后部分上传.因为文件在 hdfs 上(路径实际上是 org.apache.hadoop.fs.Path)，所以我使用输入流来发送文件数据.下面是一些伪代码(我正在评论教程中逐字逐句的块):

I'm trying to upload multiple files to Amazon S3 all under the same key, by appending the files. I have a list of file names and want to upload/append the files in that order. I am pretty much exactly following this tutorial but I am looping through each file first and uploading that in part. Because the files are on hdfs (the Path is actually org.apache.hadoop.fs.Path), I am using the input stream to send the file data. Some pseudocode is below (I am commenting the blocks that are word for word from the tutorial):

// Create a list of UploadPartResponse objects. You get one of these for
// each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();

// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
        bk.getBucket(), bk.getKey());
InitiateMultipartUploadResult initResponse =
        s3Client.initiateMultipartUpload(initRequest);
try {
      int i = 1; // part number
      for (String file : files) {
        Path filePath = new Path(file);

        // Get the input stream and content length
        long contentLength = fss.get(branch).getFileStatus(filePath).getLen();
        InputStream is = fss.get(branch).open(filePath);

        long filePosition = 0;
        while (filePosition < contentLength) {
            // create request
            //upload part and add response to our list
            i++;
        }
    }
    // Step 3: Complete.
    CompleteMultipartUploadRequest compRequest = new
          CompleteMultipartUploadRequest(bk.getBucket(),
          bk.getKey(),
          initResponse.getUploadId(),
          partETags);

    s3Client.completeMultipartUpload(compRequest);
} catch (Exception e) {
      //...
}

但是，我收到以下错误:

However, I am getting the following error:

com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 2C1126E838F65BB9), S3 Extended Request ID: QmpybmrqepaNtTVxWRM1g2w/fYW+8DPrDwUEK1XeorNKtnUKbnJeVM6qmeNcrPwc
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1109)
    at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:741)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:461)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:296)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3743)
    at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2617)

如果有人知道此错误的原因可能是什么，那将不胜感激.或者，如果有更好的方法将一堆文件连接成一个 s3 键，那也很好.我尝试使用 java 的内置 SequenceInputStream 但确实如此不行.任何帮助将不胜感激.作为参考，所有文件的总大小可能高达 10-15 GB.

If anyone knows what the cause of this error might be, that would be greatly appreciated. Alternatively, if there is a better way to concatenate a bunch of files into one s3 key, that would be great as well. I tried using java's builtin SequenceInputStream but that did not work. Any help would be greatly appreciated. For reference, the total size of all the files could be as large as 10-15 gb.

推荐答案

我知道这可能有点晚了，但值得我做出贡献.我设法使用 SequenceInputStream 解决了类似的问题.

I know it's probably a bit late but worth giving my contribution. I've managed to solve a similar problem using the SequenceInputStream.

诀窍在于能够计算结果文件的总大小，然后将 Enumeration 提供给 SequenceInputStream.

The tricks is in being able to calculate the total size of the result file and then feeding the SequenceInputStream with an Enumeration<InputStream>.

以下是一些可能有帮助的示例代码:

Here's some example code that might help:

public void combineFiles() {
    List<String> files = getFiles();
    long totalFileSize = files.stream()
                               .map(this::getContentLength)
                               .reduce(0L, (f, s) -> f + s);

    try {
        try (InputStream partialFile = new SequenceInputStream(getInputStreamEnumeration(files))) {
            ObjectMetadata resultFileMetadata = new ObjectMetadata();
            resultFileMetadata.setContentLength(totalFileSize);
            s3Client.putObject("bucketName", "resultFilePath", partialFile, resultFileMetadata);
        }
    } catch (IOException e) {
        LOG.error("An error occurred while combining files. {}", e);
    }
}

private Enumeration<? extends InputStream> getInputStreamEnumeration(List<String> files) {
    return new Enumeration<InputStream>() {
        private Iterator<String> fileNamesIterator = files.iterator();

        @Override
        public boolean hasMoreElements() {
            return fileNamesIterator.hasNext();
        }

        @Override
        public InputStream nextElement() {
            try {
                return new FileInputStream(Paths.get(fileNamesIterator.next()).toFile());
            } catch (FileNotFoundException e) {
                System.err.println(e.getMessage());
                throw new RuntimeException(e);
            }
        }

    };
}

希望这会有所帮助！

这篇关于使用 Java SDK 将多个文件批处理到 Amazon S3的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Java SDK 将多个文件批处理到 Amazon S3 [英] Batching multiple files to Amazon S3 using the Java SDK

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用 Java SDK 将多个文件批处理到 Amazon S3 [英] Batching multiple files to Amazon S3 using the Java SDK

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭