谷歌云:使用 gsutil 将数据从 AWS S3 下载到 GCS [英] Google cloud: Using gsutil to download data from AWS S3 to GCS
问题描述
One of our collaborators has made some data available on AWS and I was trying to get it into our google cloud bucket using gsutil (only some of the files are of use to us, so I don't want to use the GUI provided on GCS). The collaborators have provided us with the AWS bucket ID, the aws access key id, and aws secret access key id.
I looked through the documentation on GCE and editied the ~/.botu file such that the access keys are incorporated. I restarted my terminal and tried to do an 'ls' but got the following error:
gsutil ls s3://cccc-ffff-03210/
AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied
Do I need to configure/run something else too?
thanks!
EDITS:
Thanks for the replies!
I installed the Cloud SDK and I can access and run all gsutil commands on my google cloud storage project. My problem is in trying to access (e.g. 'ls' command) the amazon S3 that is being shared with me.
I uncommented two lines in the ~/.boto file and put the access keys:
# To add HMAC aws credentials for "s3://" URIs, edit and uncomment the # following two lines: aws_access_key_id = my_access_key aws_secret_access_key = my_secret_access_key
Output of 'gsutil version -l':
| => gsutil version -l my_gc_id gsutil version: 4.27 checksum: 5224e55e2df3a2d37eefde57 (OK) boto version: 2.47.0 python version: 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] OS: Darwin 15.4.0 multiprocessing available: True using cloud sdk: True pass cloud sdk credentials to gsutil: True config path(s): /Users/pc/.boto, /Users/pc/.config/gcloud/legacy_credentials/pc@gmail.com/.boto gsutil path: /Users/pc/Documents/programs/google-cloud- sdk/platform/gsutil/gsutil compiled crcmod: True installed via package manager: False editable install: False
The output with the -DD option is:
=> gsutil -DD ls s3://my_amazon_bucket_id multiprocessing available: True using cloud sdk: True pass cloud sdk credentials to gsutil: True config path(s): /Users/pc/.boto, /Users/pc/.config/gcloud/legacy_credentials/pc@gmail.com/.boto gsutil path: /Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gsutil compiled crcmod: True installed via package manager: False editable install: False Command being run: /Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=my_gc_id -DD ls s3://my_amazon_bucket_id config_file_list: ['/Users/pc/.boto', '/Users/pc/.config/gcloud/legacy_credentials/pc@gmail.com/.boto'] config: [('debug', '0'), ('working_dir', '/mnt/pyami'), ('https_validate_certificates', 'True'), ('debug', '0'), ('working_dir', '/mnt/pyami'), ('content_language', 'en'), ('default_api_version', '2'), ('default_project_id', 'my_gc_id')] DEBUG 1103 08:42:34.664643 provider.py] Using access key found in shared credential file. DEBUG 1103 08:42:34.664919 provider.py] Using secret key found in shared credential file. DEBUG 1103 08:42:34.665841 connection.py] path=/ DEBUG 1103 08:42:34.665967 connection.py] auth_path=/my_amazon_bucket_id/ DEBUG 1103 08:42:34.666115 connection.py] path=/?delimiter=/ DEBUG 1103 08:42:34.666200 connection.py] auth_path=/my_amazon_bucket_id/?delimiter=/ DEBUG 1103 08:42:34.666504 connection.py] Method: GET DEBUG 1103 08:42:34.666589 connection.py] Path: /?delimiter=/ DEBUG 1103 08:42:34.666668 connection.py] Data: DEBUG 1103 08:42:34.666724 connection.py] Headers: {} DEBUG 1103 08:42:34.666776 connection.py] Host: my_amazon_bucket_id.s3.amazonaws.com DEBUG 1103 08:42:34.666831 connection.py] Port: 443 DEBUG 1103 08:42:34.666882 connection.py] Params: {} DEBUG 1103 08:42:34.666975 connection.py] establishing HTTPS connection: host=my_amazon_bucket_id.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70} DEBUG 1103 08:42:34.667128 connection.py] Token: None DEBUG 1103 08:42:34.667476 auth.py] StringToSign: GET Fri, 03 Nov 2017 12:42:34 GMT /my_amazon_bucket_id/ DEBUG 1103 08:42:34.667600 auth.py] Signature: AWS RN8= DEBUG 1103 08:42:34.667705 connection.py] Final headers: {'Date': 'Fri, 03 Nov 2017 12:42:34 GMT', 'Content-Length': '0', 'Authorization': u'AWS AK6GJQ:EFVB8F7rtGN8=', 'User-Agent': 'Boto/2.47.0 Python/2.7.10 Darwin/15.4.0 gsutil/4.27 (darwin) google-cloud-sdk/164.0.0'} DEBUG 1103 08:42:35.179369 https_connection.py] wrapping ssl socket; CA certificate file=/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/third_party/boto/boto/cacerts/cacerts.txt DEBUG 1103 08:42:35.247599 https_connection.py] validating server certificate: hostname=my_amazon_bucket_id.s3.amazonaws.com, certificate hosts=['*.s3.amazonaws.com', 's3.amazonaws.com'] send: u'GET /?delimiter=/ HTTP/1.1 Host: my_amazon_bucket_id.s3.amazonaws.com Accept-Encoding: identity Date: Fri, 03 Nov 2017 12:42:34 GMT Content-Length: 0 Authorization: AWS AN8= User-Agent: Boto/2.47.0 Python/2.7.10 Darwin/15.4.0 gsutil/4.27 (darwin) google-cloud-sdk/164.0.0 ' reply: 'HTTP/1.1 403 Forbidden ' header: x-amz-bucket-region: us-east-1 header: x-amz-request-id: 60A164AAB3971508 header: x-amz-id-2: +iPxKzrW8MiqDkWZ0E= header: Content-Type: application/xml header: Transfer-Encoding: chunked header: Date: Fri, 03 Nov 2017 12:42:34 GMT header: Server: AmazonS3 DEBUG 1103 08:42:35.326652 connection.py] Response headers: [('date', 'Fri, 03 Nov 2017 12:42:34 GMT'), ('x-amz-id-2', '+iPxKz1dPdgDxpnWZ0E='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '60A164AAB3971508'), ('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')] DEBUG 1103 08:42:35.327029 bucket.py] <?xml version="1.0" encoding="UTF-8"?> <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>6097164508</RequestId><HostId>+iPxKzrWWZ0E=</HostId></Error> DEBUG: Exception stack trace: Traceback (most recent call last): File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 577, in _RunNamedCommandAndHandleExceptions collect_analytics=True) File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 317, in RunNamedCommand return_code = command_inst.RunCommand() File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/commands/ls.py", line 548, in RunCommand exp_dirs, exp_objs, exp_bytes = ls_helper.ExpandUrlAndPrint(storage_url) File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/ls_helper.py", line 180, in ExpandUrlAndPrint print_initial_newline=False) File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/ls_helper.py", line 252, in _RecurseExpandUrlAndPrint bucket_listing_fields=self.bucket_listing_fields): File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/wildcard_iterator.py", line 476, in IterAll expand_top_level_buckets=expand_top_level_buckets): File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/wildcard_iterator.py", line 157, in __iter__ fields=bucket_listing_fields): File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 413, in ListObjects self._TranslateExceptionAndRaise(e, bucket_name=bucket_name) File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1471, in _TranslateExceptionAndRaise raise translated_exception AccessDeniedException: AccessDeniedException: 403 AccessDenied AccessDeniedException: 403 AccessDenied
I'll assume that you are able to set up gcloud credentials using gcloud init
and gcloud auth login
or gcloud auth activate-service-account
, and can list/write objects to GCS successfully.
From there, you need two things. A properly configured AWS IAM role applied to the AWS user you're using, and a properly configured ~/.boto
file.
AWS S3 IAM policy for bucket access
A policy like this must be applied, either by a role granted to your user or an inline policy attached to the user.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::some-s3-bucket/*",
"arn:aws:s3:::some-s3-bucket"
]
}
]
}
The important part is that you have ListBucket
and GetObject
actions, and the resource scope for these includes at least the bucket (or prefix thereof) that you wish to read from.
.boto file configuration
Interoperation between service providers is always a bit tricky. At the time of this writing, in order to support AWS Signature V4 (the only one supported universally by all AWS regions), you have to add a couple extra properties to your ~/.boto
file beyond just credential, in an [s3]
group.
[Credentials]
aws_access_key_id = [YOUR AKID]
aws_secret_access_key = [YOUR SECRET AK]
[s3]
use-sigv4=True
host=s3.us-east-2.amazonaws.com
The use-sigv4
property cues Boto, via gsutil, to use AWS Signature V4 for requests. Currently, this requires the host be specified in the configuration, unfortunately. It is pretty easy to figure the host name out, as it follows the pattern of s3.[BUCKET REGION].amazonaws.com
.
If you have rsync/cp work from multiple S3 regions, you could handle it a few ways. You can set an environment variable like BOTO_CONFIG
before running the command to change between multiple files. Or, you can override the setting on each run using a top-level argument, like:
gsutil -o s3:host=s3.us-east-2.amazonaws.com ls s3://some-s3-bucket
Edit:
Just want to add... another cool way to do this job is rclone.
这篇关于谷歌云:使用 gsutil 将数据从 AWS S3 下载到 GCS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!