AWS Athena:跨账户写入 CTAS 查询结果 [英] AWS Athena: cross account write of CTAS query result

查看:29
本文介绍了AWS Athena:跨账户写入 CTAS 查询结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在帐户 A 中有很大的历史数据集.此数据集采用 csv 格式,并按 year/month/day/hour/ 进行分区.我的目标是将此数据转换为镶木地板,并带有额外的标准化步骤和额外的分区级别,例如year/month/day/hour/product/,写回processed/目录"下的账号A的同一个bucket;.所以目录"树看起来像

I have big historical dataset in an account A. This dataset is in csv format and partitioned by year/month/day/hour/. My goal is to convert this data to parquet, with additional normalisation steps and extra level of partitioning, e.g. year/month/day/hour/product/, and write it back to the same bucket of the account A under processed/ "directory". So "directory" tree would look like

S3_bucket_Account_A

dataset
|
├── raw
│   ├── year=2017
|   │   ├── month=01
|   |   │   ├── day=01
|   │   |   |   ├── hour=00
|   │   |   |   └── hour=01
|                                 
├── processed
│   ├── year=2017
|   │   ├── month=01
|   |   │   ├── day=01
|   |   |   │   ├── hour=00
|   |   │   |   |   ├── product=A
|   |   │   |   |   └── product=B
|   |   |   │   ├── hour=01
|   |   │   |   |   ├── product=A
|   |   │   |   |   └── product=B

为了做到这一点,我使用 boto3 API 向 Athena 发送 CTAS 查询语句.我知道CTAS 查询的限制,例如在同一个查询中最多可以写入 100 个分区,CTAS 查询结果的位置必须为空/唯一.因此,我当时处理了一个原始分区,并且考虑到这些限制,正在即时生成 CTAS 查询的内容.

In order to do that, I am sending CTAS query statements to Athena with boto3 API. I am aware of limitations of CTAS queries, e.g. can write in up to 100 partitions within the same query, location of CTAS query result must be empty/unique. So, I process one raw partition at the time and content of CTAS query is being generated on a fly taking those limitation in consideration.

由于我使用账户 B 来执行这些 CTAS 查询,但这些查询的结果应该写入到账户 A 拥有的 S3 存储桶.我已获得以下在帐户 A 的存储桶策略级别指定的权限.

Since I am using account B to execute these CTAS queries, but result of these queries should be written into S3 bucket owned by account A. I have been given the following permissions which are specified at the Bucket policy level of account A.

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "__ARN_OF_ACCOUNT_B__"
    },
    "Action": [
        "s3:*"
    ],
    "Resource": [
        "arn:aws:s3:::dataset",
        "arn:aws:s3:::dataset/*"
    ]
}

问题是帐户 A(存储桶所有者)无权访问由于帐户 B 的 Athena 执行 CTAS 查询而写入的文件.

据我所知,有一个选项是账户 A 为我创建一个 IAM 角色,然后我会像我是账户 A 一样执行这个任务.但不幸的是,这个选项是不可能的.

As I understand, there is an option of account A creating an IAM role for me, and then I would perform this task as if I were account A. But unfortunately, this options is out of the question.

我找到了如何转移所有权/更改 S3 对象的 ACL 的方法.一种方法是将 CTAS 查询结果输出到账户 B 的 S3 存储桶中,然后将这些文件复制到账户 A 的存储桶中 (原始来源)

I have found ways on how to transfer ownership/change ACL of S3 objects. One way would be to output CTAS query result in S3 bucket of account B and then copy these files to bucket of account A (original source)

aws s3 cp s3://source_awsexamplebucket/ s3://destination_awsexamplebucket/ --acl bucket-owner-full-control --recursive

另一种方法是使用类似 (原始来源)

Another way is recursively update acl with something like (original source)

aws s3 ls s3://bucket/path/ --recursive | awk '{cmd="aws s3api put-object-acl --acl bucket-owner-full-control --bucket bucket --key "$4; system(cmd)}'

但是这两个选项需要向 S3 发送额外的 GETPUT 请求,因此需要为 AWS 支付更多费用.但更重要的是,在 CTAS 查询成功后,我使用创建的表中的分区更新了 帐户 A 的 AWS Glue 表(目标表).这样,账户 A 中的 IAM 用户可以立即开始查询转换后的数据.这是我如何更新 destination_table

But these two options would require additional GET and PUT requests to S3, thus more money to pay for AWS. But more importantly, I update AWS Glue table (destination table) of account A with partitions from the created table after CTAS query succeeded. In this way, IAM users in account A can start query transformed data straight away. Here is a general idea of how I update destination_table

response = glue_client.get_partitions(
    CatalogId="__ACCOUNT_B_ID__",
    DatabaseName="some_database_in_account_B",
    TableName="ctas_table"
)

for partition in response["Partitions"]:
    for key in ["DatabaseName", "TableName", "CreationTime"]:
        partition.pop(key)
        
glue_client.batch_create_partition(
    CatalogId="__ACCOUNT_A_ID__",
    DatabaseName="some_database_in_account_A",
    TableName="destination_table",
    PartitionInputList=response["Partitions"]
)

我以这种方式而不是MSCK REPAIR TABLE destination_table 这样做,因为后者由于某种原因需要很长时间.如您所见,如果我选择使用 aws s3 cp,我在复制有关分区的元信息时也需要考虑这一点

I do it in this way instead of MSCK REPAIR TABLE destination_table because the latter takes takes long time for some reason. So as you can see, if I opt for use of aws s3 cp I would also need to take that into account when I copy meta information about partitions

所以我真正的问题是如何在另一个帐户执行的 CTAS 查询中向存储桶的所有者授予完全控制权?

刚发现类似发布,但似乎他们使用了 IAM 角色,这对我来说不是一个选项

Just found similar post, but it seems that they use IAM role which is not an option for my case

我发现:1) 无法在 CTAS 查询中更改 ACL.相反,可以复制 S3 对象本身(感谢来自 John RotensteinTheo) 拥有新的所有权.

I found out that: 1) It is not possible to change ACL within CTAS query. Instead, an S3 object can be copied on itself (thanks to comments from John Rotenstein and Theo) with a new ownership.

只是回顾一下.我从 account B 运行 CTAS 查询,但结果保存在 account A 拥有的存储桶中.这就是 CTAS 查询header"的方式.看起来像:

Just to recap. I run CTAS query from account B but result is saved in a bucket owned by account A. This is how CTAS query "header" looks like:

CREATE TABLE some_database_in_account_B.ctas_table
WITH (
  format = 'PARQUET',
  external_location = 's3://__destination_bucket_in_Account_A__/__CTAS_prefix__/',
  partitioned_by = ARRAY['year', 'month', 'day', 'hour', 'product']
) AS (
    ...
    ...
)

因为我使用 boto3 提交 CTAS 查询,并且我知道 __destination_bucket_in_Account_A____CTAS_prefix__,那么就不用用 aws cp 成功执行 CTAS 查询后,我可以在同一个 python 脚本中直接更改他们的 ACL.

Since I use boto3 to submit CTAS queries and I know __destination_bucket_in_Account_A__ together with __CTAS_prefix__, then instead of copying files on themselves with aws cp I can directly change their ACL within the same python script upon successful execution of CTAS query.

s3_resource = aws_session.resource('s3')
destination_bucket = s3_resource.Bucket(name="__destination_bucket_in_Account_A__")

for obj in destination_bucket.objects.filter(Prefix="__CTAS_prefix__"):
    object_acl = s3_resource.ObjectAcl(destination_bucket.name, obj.key)
    object_acl.put(
        ACL='bucket-owner-full-control'
    )

注意,由于我需要提交的 CTAS 查询数量超出了 AWS Athena 的限制,因此我已经实现了自动提交新查询并执行一些其他操作的逻辑,例如更新目标 Glue 表和日志记录.因此,包含这些代码行非常简单.

Note, since I need to submit a number CTAS queries which exceeds the limitation of AWS Athena, I already have implemented logic that automatically submits new queries and performs some additional things, e.g. updating destination Glue table and logging. Therefore, including these lines of code is quite straight forward.

推荐答案

目前,干净利落地执行此操作的唯一方法是使用账户 A 中的 IAM 角色以及允许账户 B 代入该角色的信任策略.您提到这对您的情况是不可能的,这很不幸.目前不可能有任何其他方式的原因是 Athena 不会使用bucket-owner-full-control"选项写入文件,因此帐户 A 永远不会完全拥有由帐户 B 中的角色发起的操作创建的任何文件.

Currently, the only way to do this cleanly is to use an IAM role in account A with a trust policy that allows account B to assume the role. You mention that this is not possible for your case, which is unfortunate. The reason why it's currently not possible any other way is that Athena will not write files with the "bucket-owner-full-control" option, so account A will never fully own any files created by an action initiated by a role in account B.

由于您在目标存储桶中授予的策略允许一切,您可以做的一件事是在 CTAS 操作完成后运行一个任务,列出创建的对象并将每个对象复制到自身(相同的源和目标键)bucket-owner-full-control"ACL 选项.像这样复制对象是更改 S3 对象的存储和 ACL 属性的常用方法.正如您所说,这将产生额外费用,但与 CTAS 费用以及与未来数据查询相关的费用相比,这些费用微不足道.

Since the policy you have been granted in the destination bucket permits everything, one thing you can do is run a task after the CTAS operation finishes that lists the objects created and copies each one to itself (same source and destination keys) with the "bucket-owner-full-control" ACL option. Copying an object like this is a common way to change storage and ACL properties of S3 objects. This will, as you say, incur additional charges, but they will be minuscule in comparison to the CTAS charges, and charges related to future queries against the data.

真正的缺点是必须在 CTAS 操作之后编写一些东西来运行,并协调它.我建议看看 Step Functions 来做这件事,你可以制作相当不错的工作流程来自动化 Athena,而且运行成本很低.我有一些应用程序或多或少地使用 Step Functions、Lambda 和 Athena 来完成您想要做的事情,并且成本很低(不过,我使用 IAM 角色进行跨账户工作).

The real downside is having to write something to run after the CTAS operation, and coordinating that. I suggest looking at Step Functions to do it, you can make quite nice workflows that automate Athena and that cost very little to run. I have applications that do more or less exactly what you're trying to do that use Step Functions, Lambda, and Athena and cost pennies (I use IAM roles for the cross account work, though).

这篇关于AWS Athena:跨账户写入 CTAS 查询结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆