AWS Athena MSCK REPAIR TABLE 表名命令 [英] AWS Athena MSCK REPAIR TABLE tablename command

查看:25
本文介绍了AWS Athena MSCK REPAIR TABLE 表名命令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个命令有多少个我们期望的分区

Is there any number of partitions we would expect this command

MSCK REPAIR TABLE tablename;

失败?

我有一个系统,目前有超过 27k 个分区,我们删除了 Athena 表的架构更改,重新创建表,并说新列添加到末尾,然后运行

I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run

MSCK REPAIR TABLE tablename;

我们每次运行 5 小时后,都无法使用此命令执行任何工作.没有添加一个分区.想知道是否有人知道我们可能已达到但无法在任何地方找到记录的分区限制的信息.

We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.

推荐答案

MSCK REPAIR TABLE 是一个效率极低的命令.我真的希望文档不要鼓励人们使用它.

MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.

要做什么取决于您的情况所特有的许多事情.

What to do instead depends on a number of things that are unique to your situation.

在一般情况下,我会建议编写一个脚本来执行 S3 列表并构建分区列表及其位置,并使用 Glue API BatchCreatePartition 将分区添加到您的表中.

In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.

当您的 S3 位置包含大量文件时,就像您的那样,我会使用 S3 Inventory 避免列出所有内容,或者列出带有 / 分隔符的对象,这样我就可以只列出存储桶的目录/分区结构部分并跳过列出所有文件.如果您避免列出所有内容,则可以相当快地列出 27K 分区.

When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.

Glue 的 BatchCreatePartitions 使用起来有点烦人,因为您必须为每个分区指定所有列、serde 和所有内容,但它比运行 ALTER TABLE ... ADD PARTTION ... 并等待查询执行完成 - 并且比 MSCK REPAIR TABLE ... 快得多.

Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….

在向现有表添加新分区时,您也不应该使用 MSCK REPAIR TABLE,原因大致相同.几乎总是在向表中添加新分区时,您知道新分区的位置,并且 ALTER TABLE ... ADD PARTTION ... 或 Glue 的 BatchCreatePartitions 可以直接使用,无需编写脚本必要的.

When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.

如果添加新数据的过程与添加新分区的过程是分开的,我建议将 S3 通知设置到 SQS 队列并定期读取消息、聚合新文件的位置并构建新分区列表

If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

这篇关于AWS Athena MSCK REPAIR TABLE 表名命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆