通过ID删除100M+文档的最快方法 [英] Fastest way to delete 100M+ documents by ID

查看:41
本文介绍了通过ID删除100M+文档的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前面临从数据库中从 10 万个文档到 1 亿个文档的多个集合中删除 1 亿多个文档的问题,该数据库总共有大约 3 亿个文档.此外,每个文档在其他集合中都有必须被取消的引用.我有一个要删除的所有文档的集合 + ID 列表,目标是尽快删除它们,以便对用户的影响最小.

I'm currently faced with removing 100M+ documents from several collections ranging from 100k documents to 100M documents in a database with ~300M documents in total. Additionally, each document has references in other collections which must be nullified. I have a list of collection + ID of all the documents to be removed, and the goal is to remove them as quickly as possible, so as to have minimal impact on users.

我目前的方法是通过 { _id: { $in: [] } } 发送要删除的 5k 组,并并行地向所有引用相同文档的集合发送更新分组时尚.结果证明这很慢,所以我正在寻找替代方案.

My current approach is to send groups of 5k to be deleted via { _id: { $in: [] } } and in parallel send updates to all of the collections referencing those documents in the same grouped fashion. This turned out to be very slow so I'm looking for alternatives.

我刚刚阅读了有关批量写入 API 的信息,我想知道这是否是更好的解决方案.如果是这样,我很好奇使用它的最有效方法是什么.我是否应该像现在一样继续分组,但在一个批量请求中一次发送多个组?我是否应该停止在查询中分组,而是使用批量请求作为我的组,其中包含 5k 个单独的删除/更新命令?

I just read about the Bulk Write API and I'm wondering if that might be a better solution. If so, I'm curious what the most efficient way to make use of it is. Should I keep grouping as I am now, but send several groups at once in one Bulk request? Should I stop grouping in the query and instead use a Bulk request as my group with 5k individual delete/update commands?

推荐答案

因为我们无法承受用户停机时间,而解决方案是每天运行(尽管规模要小得多,因为我们正在迎头赶上)第一次运行)我不能使用萨尔瓦多·达利的解决方案.我最终将我的待删除记录分组为 1k 组,并发送一个 BulkWrite 命令,其中包含每个记录的一个 delete() 操作.同时,我发送了 n 个 BulkWrite 命令来取消对每条记录的引用,其中 n 是引用这些记录的集合的数量,并且每个 BulkWrite 请求有 1k 个单独的 update() 操作,类似于 删除().这执行得相当快,所以我没有尝试通过调整 BulkWrite 命令中的操作数量来进一步优化.

Because we can't afford user downtime and the solution is to be run on a daily basis (albeit at a much smaller scale, as we're catching up with this first run) I couldn't use Salvador Dali's solution. I ended up grouping my records-to-be-deleted into groups of 1k and sending a BulkWrite command containing one delete() operation for each record. In parallel I sent n BulkWrite commands to nullify references to each record, where n is the number of collections that reference the records and where each BulkWrite request has 1k individual update() operations, similar to the delete(). This performed reasonably fast so I didn't attempt to further optimize by adjusting the number of operations in the BulkWrite commands.

这篇关于通过ID删除100M+文档的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆