AWS Athena MSCK REPAIR TABLE tablename命令 [英] AWS Athena MSCK REPAIR TABLE tablename command
问题描述
我们期望该命令有多少个分区
Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
失败了吗?
我有一个当前有超过27k分区的系统,并且为Athena表更改了架构,我们删除了该表,重新创建该表,并说到最后添加了新列,然后运行
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
该命令没有任何运气,因此每次运行5小时后, 。没有添加单个分区。想知道是否有人知道我们可能遇到的分区限制信息,但是找不到任何地方的文档。
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
推荐答案
MSCK修复表
是效率极低的命令。我真的希望文档不会鼓励人们使用它。
MSCK REPAIR TABLE
is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
该怎么做取决于您所处情况的许多独特之处。
What to do instead depends on a number of things that are unique to your situation.
在一般情况下,我建议编写一个脚本,该脚本执行S3列表并构造包含其位置的分区列表,并使用Glue API BatchCreatePartition
来添加分区
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition
to add the partitions to your table.
当您的S3位置包含很多文件时,就像听起来一样,我要么使用 S3库存以避免列出所有内容或列出分隔符为 /
,这样我就可以只列出存储桶的目录/分区结构部分,而跳过列出所有文件。如果您避免列出所有内容,则可以很快列出27K个分区。
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of /
so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue的 BatchCreatePartitions
有点烦人,因为您必须为每个分区指定所有列,序列号和所有内容,但是它比运行 ALTER TABLE…ADD PARTION…
并等待查询执行完成要快得多,这很可笑比 MSCK REPAIR TABLE…...
快。
Glue's BatchCreatePartitions
is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION …
and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE …
.
在向现有表中添加新分区时,也永远不要使用 MSCK维修台
的原因大致相同。几乎总是在将新分区添加到表时知道新分区的位置,以及 ALTER TABLE…ADD PARTION…
或Glue的 BatchCreatePartitions
可以直接使用,无需编写脚本。
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE
, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION …
or Glue's BatchCreatePartitions
can be used directly with no scripting necessary.
如果添加新数据的过程与添加新分区的过程是分开的,我建议设置将S3通知发送到SQS队列,并定期读取消息,聚集新文件的位置并从中构造新分区的列表。
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.
这篇关于AWS Athena MSCK REPAIR TABLE tablename命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!