谷歌云:使用 gsutil 将数据从 AWS S3 下载到 GCS [英] Google cloud: Using gsutil to download data from AWS S3 to GCS

查看:26
本文介绍了谷歌云:使用 gsutil 将数据从 AWS S3 下载到 GCS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

One of our collaborators has made some data available on AWS and I was trying to get it into our google cloud bucket using gsutil (only some of the files are of use to us, so I don't want to use the GUI provided on GCS). The collaborators have provided us with the AWS bucket ID, the aws access key id, and aws secret access key id.

I looked through the documentation on GCE and editied the ~/.botu file such that the access keys are incorporated. I restarted my terminal and tried to do an 'ls' but got the following error:

gsutil ls s3://cccc-ffff-03210/
AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied

Do I need to configure/run something else too?

thanks!

EDITS:

Thanks for the replies!

I installed the Cloud SDK and I can access and run all gsutil commands on my google cloud storage project. My problem is in trying to access (e.g. 'ls' command) the amazon S3 that is being shared with me.


  1. I uncommented two lines in the ~/.boto file and put the access keys:


    # To add HMAC aws credentials for "s3://" URIs, edit and uncomment the
    # following two lines:
    aws_access_key_id = my_access_key
    aws_secret_access_key = my_secret_access_key
    


  1. Output of 'gsutil version -l':


    | => gsutil version -l
    
    my_gc_id
    gsutil version: 4.27
    checksum: 5224e55e2df3a2d37eefde57 (OK)
    boto version: 2.47.0
    python version: 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1                                                 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)]
    OS: Darwin 15.4.0
    multiprocessing available: True
    using cloud sdk: True
    pass cloud sdk credentials to gsutil: True
    config path(s): /Users/pc/.boto, /Users/pc/.config/gcloud/legacy_credentials/pc@gmail.com/.boto
    gsutil path: /Users/pc/Documents/programs/google-cloud-        sdk/platform/gsutil/gsutil
    compiled crcmod: True
    installed via package manager: False
    editable install: False
    


  1. The output with the -DD option is:


    => gsutil -DD ls s3://my_amazon_bucket_id
    
    multiprocessing available: True
    using cloud sdk: True
    pass cloud sdk credentials to gsutil: True
    config path(s): /Users/pc/.boto, /Users/pc/.config/gcloud/legacy_credentials/pc@gmail.com/.boto
    gsutil path: /Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gsutil
    compiled crcmod: True
    installed via package manager: False
    editable install: False
    Command being run: /Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=my_gc_id -DD ls s3://my_amazon_bucket_id
    config_file_list: ['/Users/pc/.boto', '/Users/pc/.config/gcloud/legacy_credentials/pc@gmail.com/.boto']
    config: [('debug', '0'), ('working_dir', '/mnt/pyami'), ('https_validate_certificates', 'True'), ('debug', '0'), ('working_dir', '/mnt/pyami'), ('content_language', 'en'), ('default_api_version', '2'), ('default_project_id', 'my_gc_id')]
    DEBUG 1103 08:42:34.664643 provider.py] Using access key found in shared credential file.
    DEBUG 1103 08:42:34.664919 provider.py] Using secret key found in shared credential file.
    DEBUG 1103 08:42:34.665841 connection.py] path=/
    DEBUG 1103 08:42:34.665967 connection.py] auth_path=/my_amazon_bucket_id/
    DEBUG 1103 08:42:34.666115 connection.py] path=/?delimiter=/
    DEBUG 1103 08:42:34.666200 connection.py] auth_path=/my_amazon_bucket_id/?delimiter=/
    DEBUG 1103 08:42:34.666504 connection.py] Method: GET
    DEBUG 1103 08:42:34.666589 connection.py] Path: /?delimiter=/
    DEBUG 1103 08:42:34.666668 connection.py] Data: 
    DEBUG 1103 08:42:34.666724 connection.py] Headers: {}
    DEBUG 1103 08:42:34.666776 connection.py] Host: my_amazon_bucket_id.s3.amazonaws.com
    DEBUG 1103 08:42:34.666831 connection.py] Port: 443
    DEBUG 1103 08:42:34.666882 connection.py] Params: {}
    DEBUG 1103 08:42:34.666975 connection.py] establishing HTTPS connection: host=my_amazon_bucket_id.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
    DEBUG 1103 08:42:34.667128 connection.py] Token: None
    DEBUG 1103 08:42:34.667476 auth.py] StringToSign:
    GET
    
    
    Fri, 03 Nov 2017 12:42:34 GMT
    /my_amazon_bucket_id/
    DEBUG 1103 08:42:34.667600 auth.py] Signature:
    AWS RN8=
    DEBUG 1103 08:42:34.667705 connection.py] Final headers: {'Date': 'Fri, 03 Nov 2017 12:42:34 GMT', 'Content-Length': '0', 'Authorization': u'AWS AK6GJQ:EFVB8F7rtGN8=', 'User-Agent': 'Boto/2.47.0 Python/2.7.10 Darwin/15.4.0 gsutil/4.27 (darwin) google-cloud-sdk/164.0.0'}
    DEBUG 1103 08:42:35.179369 https_connection.py] wrapping ssl socket; CA certificate file=/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/third_party/boto/boto/cacerts/cacerts.txt
    DEBUG 1103 08:42:35.247599 https_connection.py] validating server certificate: hostname=my_amazon_bucket_id.s3.amazonaws.com, certificate hosts=['*.s3.amazonaws.com', 's3.amazonaws.com']
    send: u'GET /?delimiter=/ HTTP/1.1
    Host: my_amazon_bucket_id.s3.amazonaws.com
    Accept-Encoding: identity
    Date: Fri, 03 Nov 2017 12:42:34 GMT
    Content-Length: 0
    Authorization: AWS AN8=
    User-Agent: Boto/2.47.0 Python/2.7.10 Darwin/15.4.0 gsutil/4.27 (darwin) google-cloud-sdk/164.0.0
    
    '
    reply: 'HTTP/1.1 403 Forbidden
    '
    header: x-amz-bucket-region: us-east-1
    header: x-amz-request-id: 60A164AAB3971508
    header: x-amz-id-2: +iPxKzrW8MiqDkWZ0E=
    header: Content-Type: application/xml
    header: Transfer-Encoding: chunked
    header: Date: Fri, 03 Nov 2017 12:42:34 GMT
    header: Server: AmazonS3
    DEBUG 1103 08:42:35.326652 connection.py] Response headers: [('date', 'Fri, 03 Nov 2017 12:42:34 GMT'), ('x-amz-id-2', '+iPxKz1dPdgDxpnWZ0E='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '60A164AAB3971508'), ('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')]
    DEBUG 1103 08:42:35.327029 bucket.py] <?xml version="1.0" encoding="UTF-8"?>
    <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>6097164508</RequestId><HostId>+iPxKzrWWZ0E=</HostId></Error>
    DEBUG: Exception stack trace:
    Traceback (most recent call last):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 577, in _RunNamedCommandAndHandleExceptions
        collect_analytics=True)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 317, in RunNamedCommand
        return_code = command_inst.RunCommand()
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/commands/ls.py", line 548, in RunCommand
        exp_dirs, exp_objs, exp_bytes = ls_helper.ExpandUrlAndPrint(storage_url)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/ls_helper.py", line 180, in ExpandUrlAndPrint
        print_initial_newline=False)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/ls_helper.py", line 252, in _RecurseExpandUrlAndPrint
        bucket_listing_fields=self.bucket_listing_fields):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/wildcard_iterator.py", line 476, in IterAll
        expand_top_level_buckets=expand_top_level_buckets):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/wildcard_iterator.py", line 157, in __iter__
        fields=bucket_listing_fields):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 413, in ListObjects
        self._TranslateExceptionAndRaise(e, bucket_name=bucket_name)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1471, in _TranslateExceptionAndRaise
        raise translated_exception
    AccessDeniedException: AccessDeniedException: 403 AccessDenied
    
    
    AccessDeniedException: 403 AccessDenied
    

解决方案

I'll assume that you are able to set up gcloud credentials using gcloud init and gcloud auth login or gcloud auth activate-service-account, and can list/write objects to GCS successfully.

From there, you need two things. A properly configured AWS IAM role applied to the AWS user you're using, and a properly configured ~/.boto file.

AWS S3 IAM policy for bucket access

A policy like this must be applied, either by a role granted to your user or an inline policy attached to the user.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::some-s3-bucket/*",
                "arn:aws:s3:::some-s3-bucket"
            ]
        }
    ]
}

The important part is that you have ListBucket and GetObject actions, and the resource scope for these includes at least the bucket (or prefix thereof) that you wish to read from.

.boto file configuration

Interoperation between service providers is always a bit tricky. At the time of this writing, in order to support AWS Signature V4 (the only one supported universally by all AWS regions), you have to add a couple extra properties to your ~/.boto file beyond just credential, in an [s3] group.

[Credentials]
aws_access_key_id = [YOUR AKID]
aws_secret_access_key = [YOUR SECRET AK]
[s3]
use-sigv4=True
host=s3.us-east-2.amazonaws.com

The use-sigv4 property cues Boto, via gsutil, to use AWS Signature V4 for requests. Currently, this requires the host be specified in the configuration, unfortunately. It is pretty easy to figure the host name out, as it follows the pattern of s3.[BUCKET REGION].amazonaws.com.

If you have rsync/cp work from multiple S3 regions, you could handle it a few ways. You can set an environment variable like BOTO_CONFIG before running the command to change between multiple files. Or, you can override the setting on each run using a top-level argument, like:

gsutil -o s3:host=s3.us-east-2.amazonaws.com ls s3://some-s3-bucket

Edit:

Just want to add... another cool way to do this job is rclone.

这篇关于谷歌云:使用 gsutil 将数据从 AWS S3 下载到 GCS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆