存储数百万图像 [英] Storage for millions of images

查看:147
本文介绍了存储数百万图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为数亿图片准备一个存储(现在我有7000万的图片,这个数字还在增长)。每个图像有约。 20kB。当然,我可以将它们存储在一个文件系统,但我害怕的节点数。我测试了MongoDB和Cassandra。两者都有缺点(我有限的硬盘资源):




  • MongoDB - 磁盘空间消耗是原始数据大小的3倍

  • Cassandra - 磁盘空间消耗与原始数据大小类似,但Cassandra需要大量可用空间用于压缩过程



任何人都可以建议这种问题的正确解决方案?

解决方案

视频分发与S3(包括Rackspace云文件)和MongoDB。



大多数人,没有一瞥,会去S3,但我发现,两者都有他们的缺点。其中一个大问题是S3不是CDN,它实际上是一个特定区域内的一个冗余存储,不会复制到其他S3区域,这意味着您需要使用像S3之类的cloudfront来ping您的映像到一种缓存,如果你要在您的网站上严重负载。



S3还具有其他功能,这使得它更少的CDN-ish和更多的存储仓库。也就是说,对于很少访问的文件,S3非常快。



这种双层当然会产生复杂性,如维护。不仅如此,一个CDN将工作在TTLs,即使现在许多CDNs有边缘清除能力,他们仍然不是一个100%确定的方式,以确保您的文件是不可访问的。



因此,由于设置和访问(可能访问的文件,应该删除,以及),这可能会非常昂贵很快。



这是MongoDB 可以赢的地方。 MongoDB可能,根据你的情况,实际上在这里更便宜,因为你可以使用一大堆微实例在AWS上实际保存你的信息,添加点实例保留到这些实例(脏的廉价)和所有你需要是单个机器上的大磁盘。



地狱,你甚至可以使用S3存储图像,然后将MongoDB作为云端替换。



当你想将映像映射到不同的区域时,你只需在目标区域中创建几个spot实例,并让MongoDB复制它的数据。你可以做一些kool东西与复制,以确保只有经常访问的文件从该区域放置在该区域。



所以我不会引发MongoDB out



编辑



作为一个附加的注意到S3定价,如果你存储你的文件在RR(减少冗余),那么价格一半(约),这使S3非常便宜,但是,你仍然有问题,S3不是一个CDN。



进一步编辑



因为我真的只是从@cirrus的答案,我实际上会重新评估你的问题,



例如,Youtube实际上将所有图像存储在单个计算机上,然后分发,因此他们可以轻松管理200万个缩略图和... ... ...很多的观点每天都很容易从文件系统。所以我认为你对文件系统的担心是超额的。



至于哪个数据库更好...我不知道,这归结到你的测试。 p>

我的意思是你的问题的答案取决于你的方案和预算,你的硬件和资源,即如果你有AWS服务器,这将是一个完全不同的答案比专用内部服务器。


I need to prepare a storage for hundreds of millions of images (now I have 70 millions and this number is still growing). Each image has approx. 20kB. Of course I can store them in a filesystem, but I'm affraid of number of inodes. I have tested MongoDB and Cassandra. Both of them have disadvantages (I have limited HDD resources):

  • MongoDB - disk space consumption is 3 times larger than size of raw data
  • Cassandra - disk space consumption is similar to size of raw data but Cassandra needs a lot of free space for compaction procedure

Anybody can suggest proper solution for this kind of problem ?

解决方案

I have, in my life, done video distribution with both S3 (Rackspace cloudfiles included) and MongoDB.

Most people, without a second glance, would go for S3 however I have found that both have their downsides. One of the big problems is that S3 is not a CDN, it is actually a redundant storage within a specific region that is not replicated to other S3 regions, this means you will need to use something like cloudfront on top of S3 to ping your images to a sort of cache if you were to get serious load on your site.

S3 also has other features which makes it less CDN-ish and more of a storage warehouse. That being said, for infrequently accessed files S3 is blazingly fast.

This dual layer of course creates complexities such as maintenance. Not only that but a CDN will work upon TTLs and even though many CDNs now-a-days have edge purge abilities they still are not a 100% sure way of making sure your files are not accessible.

So due to the set-up and the accesses (possible accesses of files that should be deleted as well) this could get quite costly quite quickly.

This is where MongoDB could win. MongoDB could, depending on your scenario, actually be cheaper here due to the fact that you could use a whole bunch of micro instances on AWS to actually hold your information in, adding spot instance reservation to these instances (dirt cheap) and all you need is a big disk on a single machine.

Hell, you could even use S3 to store the images and then MongoDB as a cloudfront replacement.

When you want to ping images to different regions you just make a few spot instances in that target region and get MongoDB to replicate it's data across. You can do some kool stuff with the replication too to make sure only frequently accessed files from that region are placed in that region.

So I wouldn't throw MongoDB out (or even Cassandra), rather I would do a means test between the two.

Edit

As an added note about S3 pricing, if you store your files in RR (Reduced Redundancy) then the price halves (about) which makes S3 very cheap, however, you still have the problem that S3 is not a CDN.

Further Edit

Since I really only carried on from @cirrus' answer I will actually re-evaluate your question which is kinda answered above.

As an example, Youtube actually stores all of their images on single computers that are then distributed, so they can easily manage 200m thumbnails and...well...a lot of views each day easily from the file system. So I think your worry about the file system is over-rated.

As for which database is better...I dunno, that comes down to your testing.

I mean the answer to your problem depends upon your scenario and your budget and your hardware and your resources, i.e. if you has AWS servers this would be a whole different answer than dedicated in house servers.

这篇关于存储数百万图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆