AWS Athena MSCK REPAIR TABLE tablename命令 [英] AWS Athena MSCK REPAIR TABLE tablename command

查看:209
本文介绍了AWS Athena MSCK REPAIR TABLE tablename命令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们期望该命令有多少个分区

Is there any number of partitions we would expect this command

MSCK REPAIR TABLE tablename;

失败了吗?

我有一个当前有超过27k分区的系统,并且为Athena表更改了架构,我们删除了该表,重新创建该表,并说到最后添加了新列,然后运行

I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run

MSCK REPAIR TABLE tablename;

该命令没有任何运气,因此每次运行5小时后, 。没有添加单个分区。想知道是否有人知道我们可能遇到的分区限制信息,但是找不到任何地方的文档。

We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.

推荐答案

MSCK修复表是效率极低的命令。我真的希望文档不会鼓励人们使用它。

MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.

该怎么做取决于您所处情况的许多独特之处。

What to do instead depends on a number of things that are unique to your situation.

在一般情况下,我建议编写一个脚本,该脚本执行S3列表并构造包含其位置的分区列表,并使用Glue API BatchCreatePartition 来添加分区

In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.

当您的S3位置包含很多文件时,就像听起来一样,我要么使用 S3库存以避免列出所有内容或列出分隔符为 / ,这样我就可以只列出存储桶的目录/分区结构部分,而跳过列出所有文件。如果您避免列出所有内容,则可以很快列出27K个分区。

When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.

Glue的 BatchCreatePartitions 有点烦人,因为您必须为每个分区指定所有列,序列号和所有内容,但是它比运行 ALTER TABLE…ADD PARTION…并等待查询执行完成要快得多,这很可笑比 MSCK REPAIR TABLE…... 快。

Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….

在向现有表中添加新分区时,也永远不要使用 MSCK维修台的原因大致相同。几乎总是在将新分区添加到表时知道新分区的位置,以及 ALTER TABLE…ADD PARTION…或Glue的 BatchCreatePartitions 可以直接使用,无需编写脚本。

When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.

如果添加新数据的过程与添加新分区的过程是分开的,我建议设置将S3通知发送到SQS队列,并定期读取消息,聚集新文件的位置并从中构造新分区的列表。

If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

这篇关于AWS Athena MSCK REPAIR TABLE tablename命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆