MongoDB作为文件存储 [英] MongoDB as file storage

查看:186
本文介绍了MongoDB作为文件存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到最佳的解决方案,以为大文件创建可扩展的存储.文件大小可以从1-2 MB到500-600 GB不等.

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.

我已经找到了有关Hadoop及其HDFS的一些信息,但是它看起来有点复杂,因为我不需要任何Map/Reduce作业和许多其他功能.现在,我正在考虑使用MongoDB及其GridFS作为文件存储解决方案.

I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.

现在是问题:

  1. 当我尝试写几个文件时,gridfs会发生什么 同时.读/写操作是否有锁? (我将仅将其用作文件存储)
  2. gridfs中的文件会被缓存在ram中吗?它将如何影响读写性能?
  3. 也许还有其他一些解决方案可以更有效地解决我的问题?
  1. What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
  2. Will files from gridfs be cached in ram and how it will affect read-write perfomance?
  3. Maybe there are some other solutions that can solve my problem more efficiently?

谢谢.

推荐答案

我只能在这里为MongoDB回答,我不会假装对HDFS和其他此类技术了解很多.

I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.

GridFs实现完全是驱动程序本身的客户端.这意味着在MongoDB本身中不需要特殊加载或理解文件服务的上下文,实际上MongoDB本身甚至不了解它们是文件(

The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).

这意味着查询fileschunks集合的任何部分将导致与其他任何查询相同的过程,从而将所需的数据加载到您的工作集中( http://en.wikipedia.org/wiki/Working_set ),它代表一组数据(或所有已加载的数据)当时),以确保MongoDB在给定的时间范围内保持最佳性能.它是通过将其分页到RAM来完成此操作的(在技术上,操作系统也是如此).

This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).

要考虑的另一点是这是驱动程序实现的.这意味着规格可能会有所不同,但是我认为不会.所有驱动程序都允许您从files集合中查询一组文档,该文档只包含文件元数据,允许您以后通过单个查询从chunks集合中提供文件本身.

Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.

但是,这并不重要,您想要提供文件本身,包括其数据;这意味着您将把files集合及其后续的chunks集合加载到您的工作集中.

However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.

考虑到这一点,我们已经遇到了第一个障碍:

With that in mind we have already hit the first snag:

gridfs中的文件会被缓存在ram中吗?它将如何影响读写性能?

Will files from gridfs be cached in ram and how it will affect read-write perfomance?

直接从RAM读取小文件的性能可能很棒.这样写就一样好.

The read performance of small files could be awesome, directly from RAM; the writes would be just as good.

对于较大的文件,不是这样.大多数计算机将没有600 GB的RAM,实际上,很正常,很可能在单个mongod实例上容纳单个文件的600 GB分区.这就产生了一个问题,因为要提供该文件,需要将该文件放入您的工作集中,但是它不可能比您的RAM大;在这一点上,您可能会出现页面混乱( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ),由此服务器仅在尝试加载文件时出现24/7页面错误.这里的文字也不是更好.

For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.

唯一的解决方法是开始在多个碎片:\中放置单个文件.

The only way around this is to starting putting a single file across many shards :\.

注意:还需要考虑的另一件事是,chunks块"的默认平均大小为256KB,因此对于600GB文件来说,这是很多文档.在大多数驱动程序中均可操作此设置.

Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.

当我尝试同时写入几个文件时,gridfs会发生什么.读/写操作是否有锁? (我将仅将其用作文件存储)

What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)

GridFS仅作为规范使用与其他任何集合相同的锁,在数据库级别(2.2+)或全局级别(2.2之前)的读和写锁.两者也会互相干扰,即如何确保对要写入的文档的读取一致?

GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?

话虽这么说,但根据您的方案具体情况,流量,并发写入/读取次数以及我们不知道的许多其他因素,存在争用的可能性.

That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.

也许还有其他一些解决方案可以更有效地解决我的问题?

Maybe there are some other solutions that can solve my problem more efficiently?

我个人发现以减少冗余的格式存储S3(如@mluggy所说)最适合在MongoDB中存储有关文件的一小部分元数据,就像使用GridFS一样,但没有块集合,让S3处理所有分发,备份和其他东西给您.

I personally have found that S3 (as @mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.

希望我已经清楚了,希望对您有所帮助.

Hopefully I have been clear, hope it helps.

与我不经意间说的不同,MongoDB没有集合级别的锁,它是数据库级别的锁.

Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.

这篇关于MongoDB作为文件存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆