从GridFS中清除孤立的文件 [英] Cleaning orphaned files out of GridFS

查看:137
本文介绍了从GridFS中清除孤立的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个引用GridFS文件的集合,通常每条记录有1-2个文件.集合相当大-父集合中约705k记录,以及790k GridFS文件.随着时间的流逝,已经出现了许多孤立的GridFS文件-父记录已删除,但引用的文件却没有.我现在正在尝试从GridFS集合中清除孤立的文件.

I have a collection referencing GridFS files, generally 1-2 files per record. The collections are reasonably large - about 705k records in the parent collection, and 790k GridFS files. Over time, there have become a number of orphaned GridFS files - the parent records were deleted, but the referenced files weren't. I'm now attempting to clean the orphaned files out of the GridFS collection.

此处这样的方法存在的问题是结合了700k将记录记录到一个较大的id列表中会生成一个内存为4mb的Python列表-将其传递到fs.files集合上的Mongo中的$ nin查询中实际上是永远的.进行相反操作(获取fs.files中所有ID的列表并查询父集合以查看它们是否存在)也将花费很多时间.

The problem with an approach like suggested here is that combining the 700k records into a single large list of ids results in a Python list that's about 4mb in memory - passing that into a $nin query in Mongo on the fs.files collection takes literally forever. Doing the reverse (get a list of all ids in fs.files and querying the parent collection to see if they exist) also takes forever.

有人反对这个问题并开发出更快的解决方案吗?

Has anybody come up against this and developed a faster solution?

推荐答案

首先,让我们花点时间考虑一下 GridFS 实际上是.首先,让我们从参考的手册页中进行阅读:

Firstly, let's take the time to consider what GridFS actually is. And as a starter, lets read from the manual page that is referenced:

GridFS是用于存储和检索超出BSON文档因此,这很可能是您的用例.但是要在这里学习的教训是 GridFS 并不是自动 >用于存储文件的转到"方法.

So with that out of the way, and that may well be your use case. But the lesson to learn here is that GridFS is not automatically the "go-to" method for storing files.

您的案例(和其他案例)在这里发生的原因是因为这是驱动程序级别" 规范(而MongoDB本身在这里没有没有的魔力),您的文件"已拆分"到两个馆藏中.一个集合主要用于引用内容,另一个集合用于数据的块".

What has happened here in your case (and others) is because of the "driver level" specification that this is (and MongoDB itself does no magic here), Your "files" have been "split" across two collections. One collection for the main reference to the content, and the other for the "chunks" of data.

您的问题(和其他问题)是,既然主"引用已被删除,您已经成功地抛弃了块".因此,大量的人如何摆脱孤儿.

Your problem (and others), is that you have managed to leave behind the "chunks" now that the "main" reference has been removed. So with a large number, how to get rid of the orphans.

您当前的读数为循环并比较",并且由于MongoDB 不加入联接,因此确实没有其他答案.但是有些事情可以帮忙.

Your current reading says "loop and compare", and since MongoDB does not do joins, then there really is no other answer. But there are some things that can help.

因此,与其运行庞大的$nin,不如尝试做一些不同的事情来解决这个问题.考虑以相反的顺序进行工作,例如:

So rather than run a huge $nin, try doing a few different things to break this up. Consider working on the reverse order, for example:

db.fs.chunks.aggregate([
    { "$group": { "_id": "$files_id" } },
    { "$limit": 5000 }
])

因此,您要在其中执行的操作是从所有条目中获取 distinct "files_id"值(是对fs.files的引用),以便从5000个条目中开始.然后当然可以返回循环了,检查fs.files是否有匹配的_id.如果找不到任何内容,则从您的块"中删除与"files_id"匹配的文档.

So what you are doing there is getting the distinct "files_id" values (being the references to fs.files ), from all of the entries, for 5000 of your entries to start with. Then of course you're back to the looping, checking fs.files for a matching _id. If something is not found, then remove the documents matching "files_id" from your "chunks".

但是只有5000个,所以保留该集合中的最后一个 ID,因为现在您将再次运行相同的聚合语句,但有所不同:

But that was only 5000, so keep the last id found in that set, because now you are going to run the same aggregate statement again, but differently:

db.fs.chunks.aggregate([
    { "$match": { "files_id": { "$gte": last_id } } },
    { "$group": { "_id": "$files_id" } },
    { "$limit": 5000 }
])

因此这可行,因为ObjectId值是 单调 或不断增加".因此,所有 new 条目总是大于最后一个.然后,您可以再次循环这些值,并在找不到的地方执行相同的删除操作.

So this works because the ObjectId values are monotonic or "ever increasing". So all new entries are always greater than the last. Then you can go an loop those values again and do the same deletes where not found.

这个永远拿走".好吧,.您可能使用 db.eval() ,但阅读文档.但总的来说,这是您使用两个收藏集要支付的价格.

Will this "take forever". Well yes. You might employ db.eval() for this, but read the documentation. But overall, this is the price you pay for using two collections.

重新开始. GridFS 规范是按照以下方式设计设计的,因为它特别希望解决16MB的限制.但是,如果这不是不是您的限制,那么请问为什么,您使用的是 GridFS 首先.

Back to the start. The GridFS spec is designed this way because it specifically wants to work around the 16MB limitation. But if that is not your limitation, then question why you are using GridFS in the first place.

MongoDB可以没问题,将二进制"数据存储在给定BSON文档的任何元素中.因此,您不需要仅使用 GridFS 进行存储文件.而且,如果您这样做了,那么所有所有更新都是完全原子的",因为它们仅对一个集合中的一个文档起作用一次.

MongoDB has no problem storing "binary" data within any element of a given BSON document. So you do not need to use GridFS just to store files. And if you had done so, then all of your updates would be completely "atomic", as they only act on one document in one collection at a time.

由于 GridFS 故意将文档拆分为各个集合,然后,如果您使用它,那么您将痛苦不堪.因此,如果您需要,请使用它,但如果不需要,则只需将BinData存储为常规字段,这些问题就会消失.

Since GridFS deliberately splits documents across collections, then if you use it, then you live with the pain. So use it if you need it, but if you do not, then just store the BinData as a normal field, and these problems go away.

但是至少与将所有内容加载到内存相比,您有更好的方法.

But at least you have a better approach to take than loading everything into memory.

这篇关于从GridFS中清除孤立的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆