从Google的Dataproc读取S3数据 [英] Reading S3 data from Google's dataproc

查看:130
本文介绍了从Google的Dataproc读取S3数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过Google的dataproc在我创建的集群上运行pyspark应用程序.在一个阶段中,应用程序需要访问Amazon S3目录中的目录.在那个阶段,我得到了错误:

I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error:

AWS访问密钥ID和秘密访问密钥必须分别指定为s3 URL的用户名或密码,或者分别通过设置fs.s3.awsAccessKeyId或fs.s3.awsSecretAccessKey属性来指定.

AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

我登录到群集的头节点,并使用我的AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY信息设置了/etc/boto.cfg,但这并不能解决访问问题.

I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY information, but that didn't solve the access issue.

(1)关于如何从dataproc集群访问AWS S3的其他建议?

(1) Any other suggestions for how to access AWS S3 from a dataproc cluster?

(2)另外,dataproc用于访问集群的用户名是什么?如果知道这一点,则可以为该用户在群集上设置〜/.aws目录.

(2) Also, what is the name of the user that dataproc uses to access the cluster? If I knew that, I could set the ~/.aws directory on the cluster for that user.

谢谢.

推荐答案

由于您使用的是Hadoop/Spark接口(如sc.textFile),因此,实际上应该通过fs.s3.*fs.s3n.*fs.s3a.*键,而不是尝试通过任何~/.aws/etc/boto.cfg设置进行连线.您可以通过以下几种方法将这些设置添加到Dataproc集群中:

Since you're using the Hadoop/Spark interfaces (like sc.textFile), everything should indeed be done through the fs.s3.* or fs.s3n.* or fs.s3a.* keys rather than trying to wire through any ~/.aws or /etc/boto.cfg settings. There are a few ways you can plumb those settings through to your Dataproc cluster:

在集群创建时:

gcloud dataproc clusters create --properties \
    core:fs.s3.awsAccessKeyId=<s3AccessKey>,core:fs.s3.awsSecretAccessKey=<s3SecretKey> \
    --num-workers ...

此处的core前缀表示您希望将设置放置在core-site.xml文件中,如

The core prefix here indicates you want the settings to be placed in the core-site.xml file, as explained in the Cluster Properties documentation.

或者,在提交作业时,如果您使用的是Dataproc的API:

Alternatively, at job-submission time, if you're using Dataproc's APIs:

gcloud dataproc jobs submit pyspark --cluster <your-cluster> \
    --properties spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey>,spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey> \
    ...

在这种情况下,我们将属性传递为Spark属性,而Spark提供了一种便捷的机制,只需使用spark.hadoop.*前缀即可将"hadoop" conf属性定义为Spark conf的子集.如果您是通过SSH在命令行上提交的,则等同于:

In this case, we're passing the properties through as Spark properties, and Spark provides a handy mechanism to define "hadoop" conf properties as a subset of Spark conf, simply using the spark.hadoop.* prefix. If you're submitting at the command line over SSH, this is equivalent to:

spark-submit --conf spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey> \
    --conf spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey>

最后,如果您想在集群创建时进行设置,但又不想在Dataproc元数据中明确设置访问密钥,则可以选择使用

Finally, if you want to set it up at cluster creation time but prefer not to have your access keys explicitly set in your Dataproc metadata, you might opt to use an initialization action instead. There's a handy tool called bdconfig that should be present on the path with which you can modify XML settings easily:

#!/bin/bash
# Create this shell script, name it something like init-aws.sh
bdconfig set_property \
    --configuration_file /etc/hadoop/conf/core-site.xml \
    --name 'fs.s3.awsAccessKeyId' \
    --value '<s3AccessKey>' \
    --clobber
bdconfig set_property \
    --configuration_file /etc/hadoop/conf/core-site.xml \
    --name 'fs.s3.awsSecretAccessKey' \
    --value '<s3SecretKey>' \
    --clobber

将其上传到某个地方的GCS存储桶,并在创建集群时使用它:

Upload that to a GCS bucket somewhere, and use it at cluster creation time:

gsutil cp init-aws.sh gs://<your-bucket>/init-aws.sh
gcloud dataproc clustres create --initialization-actions \
    gs://<your-bucket>/init-aws.sh

虽然确实像其他任何用户数据一样对Dataproc元数据进行了静态加密和高度保护,但是使用init操作可帮助防止在查看Dataproc群集属性时无意中向站在屏幕后面的人显示访问密钥/秘密.

While Dataproc metadata is indeed encrypted at rest and heavily secured just like any other user data, using the init action instead helps prevent inadvertently showing your access key/secret for example to someone standing behind your screen when viewing your Dataproc cluster properties.

这篇关于从Google的Dataproc读取S3数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆