Redshift 频谱可以读取清单文件(用于外部表)但不能读取实际文件? [英] Redshift spectrum can read read the the manifest file (for external table) but not the actual files?

查看:30
本文介绍了Redshift 频谱可以读取清单文件(用于外部表)但不能读取实际文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Amazon Redshift 中有一个外部表:

创建外部表ruben.ruben_manifest_test(customer_id bigint,external_cust_id varchar(30))行格式 serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'与 serdeproperties('serialization.format'='1')存储为输入格式'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'输出格式'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'位置 's3://mybucket/folder1/redshift_external_table_location.txt';

位置redshift_external_table_location.txt 是一个清单文件,内容如下:

<代码>{条目":[{ "url": "s3://mybucket/folder2/file1.parquet",元":{内容长度":8059990 } }]}

当我执行 select count(*) from ruben.ruben_manifest_test 时,我得到:

<预><代码>频谱扫描错误.禁止:HTTP 响应错误代码:403 消息:AccessDenied 拒绝访问x-amz-request-id: HBM2Q5DVA4WD7YJEx-amz-id-2: OX8Z9+4NaKFLuDoQ7dWfAgtOy6KAN2ld+y2ksTo3Ela5xtgjonOQLEWdiZ5YuFq+PFNkSDfzOPs= (s3://mybucket/folder2/file1.parquet)

很明显,Redshift 能够读取清单文件(并获取 parquet 文件的 s3 uri),但由于 Access Denied,它无法读取 Parquet 文件.

如果我将外部表的位置更改为指向包含带有 alter table ruben.ruben_manifest_test set location 's3://mybucket/folder2/' 的镶木地板文件的文件夹,则 select 将起作用.我认为这意味着至少当位置是文件夹时,redshift 能够正确读取镶木地板文件,由于某种原因,当文件从清单文件中列出时它不起作用.

但我不明白用于获取 parquet 文件的凭据是什么,因为与 redshift 集群关联的 iam 角色肯定能够读取这些文件.据我了解,从 select * from svv_external_schemas 的输出中可以看出,外部架构 ruben 的 IAM 角色是 arn:aws:iam::xxxxxxx:role/mySpectrumRole 具有允许 glue:*lakeformation:GetDataAccess 的策略.在 AWS LakeFormation 中,mySpectrumRole 可以访问数据位置 s3://mybucket/*

您知道可能是什么原因或如何排除故障吗?具体来说,有什么方法可以查看第二个 S3 请求(用于 parquet 文件的请求)使用的是什么 iam 角色?

在我看来,Redshift 要求 LakeFormation (GetDataAccess) 提供凭据,并且 获取缩小范围的凭据以仅访问表位置(这只是清单文件),而不是获取对清单和数据均有效的凭据文件.

解决方案

AWS 服务与 Lake Formation 的集成 说:

<块引用>

支持使用清单的查询.

因此,如果清单或数据文件位于 AWS LakeFormation 中已注册的数据湖位置,则您不能拥有位置为清单文件的 Redshift Spectrum 外部表.

可惜了

  • 这似乎记录在 Lake Formation 文档中,而不记录在 Redshift 文档中(我通过 [Using Redshift Spectrum with AWS Lake Formation][2] 中的反馈按钮询问他们添加了一个关于Lake Formation 不支持 manifest)
  • Redshift 无法识别出这是不受支持的情况,因此错误消息未提及

I have an external table in Amazon Redshift:

create external table ruben.ruben_manifest_test
(
    customer_id bigint,
    external_cust_id varchar(30)
    
)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://mybucket/folder1/redshift_external_table_location.txt';

The location redshift_external_table_location.txt is a manifest file with the following contents:

{
  "entries": [
    { "url": "s3://mybucket/folder2/file1.parquet",
      "meta": { "content_length": 8059990 } }
  ]
}

When I do select count(*) from ruben.ruben_manifest_test I get:


Spectrum Scan Error. Forbidden: HTTP response error code: 403 Message: AccessDenied Access Denied
x-amz-request-id: HBM2Q5DVA4WD7YJE
x-amz-id-2: OX8Z9+4NaKFLuDoQ7dWfAgtOy6KAN2ld+y2ksTo3Ela5xtgjonOQLEWdiZ5YuFq+PFNkSDfzOPs= (s3://mybucket/folder2/file1.parquet)

So obviously Redshift is able to read the manifest file (and obtain the s3 uri for the parquet file), but it's not able to read the parquet file due to Access Denied.

If I change the location for the external table to point to the folder containing the parquet file with alter table ruben.ruben_manifest_test set location 's3://mybucket/folder2/' then the select will work. I think that means that redshift is able to read the parquet files alright at least when the location is a folder, for some reason it does not work when the files are listed from a manifest file.

But I do not understand what credentials are being used to fetch the parquet file, because the iam roles associated with the redshift cluster are able to read those files for sure. As far as I understand the from the output of select * from svv_external_schemas is that the IAM role for the external schema ruben is arn:aws:iam::xxxxxxx:role/mySpectrumRole with a policy that allows glue:* and lakeformation:GetDataAccess. In AWS LakeFormation mySpectrumRole has access to the data location s3://mybucket/*

Any idea on what can be the cause or how to troubleshoot it? Specifically, is there any way to see what iam role is being used for the second S3 request (the one for the parquet file)?

It seems to me that Redshift asks LakeFormation (GetDataAccess) for credentials and gets credentials narrowed down to access just the table location (which is just a manifest files) instead of getting credentials valid for both the manifest and the data files.

解决方案

The AWS Service Integration with Lake Formation says:

Queries using manifests are not supported.

So you can't have a Redshift Spectrum external table where the location is a manifest file if the manifest or data files are in a registered data lake location in AWS LakeFormation.

It's a pity that

  • This seems to be documented only in Lake Formation documentation and not on Redshift documentation (I asked via feedback button in [Using Redshift Spectrum with AWS Lake Formation][2] that they add a mention to the fact that manifest are not supported with Lake Formation)
  • Redshift does not recognizes that this is an unsupported case therefore the error message does not mention that

这篇关于Redshift 频谱可以读取清单文件(用于外部表)但不能读取实际文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆