大型图像存储 [英] Large scale image storage

查看:178
本文介绍了大型图像存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我就可能会涉及在一个项目中的一个重要组成部分是大量的文件的存储(在此情况下的图像,但它应该只是作为一个文件储存)。

I will likely be involved in a project where an important component is a storage for a large number of files (in this case images, but it should just act as a file storage).

传入的文件数预计为每星期约50万(平均每100 KB左右),峰值每天大约有100,000个文件和5每秒。
文件总数预计达到哪里文件被在输入速率种种原因过期平衡前达到几十万元。

Number of incoming files is expected to be around 500,000 per week (averaging around 100 Kb each), peaking around 100,000 files per day and 5 per second. Total number of files is expected to reach tens of million before reaching an equilibrium where files are being expired for various reasons at the input rate.

所以,我需要,可以在高峰时段每秒约5文件存储,而读约4并随时删除4的系统。

So I need a system that can store around 5 files per second at peak hours, while reading around 4 and deleting 4 at any time.

我最初的想法是,用于存储,到期和阅读简单的服务纯NTFS文件系统实际上应该是足够了。我能想象的服务创造每一个年,月,日,时子文件夹,以保持每个文件夹的文件数目在最低限度,并允许在应情况需要手动过期。

My initial idea is that a plain NTFS file system with a simple service for storing, expiring and reading should actually be sufficient. I could imagine the service creating sub-folders for each year, month, day and hour to keep the number of files per folder at a minimum and to allow manual expiration in case that should be needed.

一个大的NTFS解决方案已经讨论<一个href=\"http://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories\">here,但我仍然可以使用于建设具有上述规格的存储时,会发生什么问题的一些建议,会发生什么维护问题和存在哪些替代方案。 preferably我想避免一个分布式存储,如果可能的话,实用。

A large NTFS solution has been discussed here, but I could still use some advice on what problems to expect when building a storage with the mentioned specifications, what maintenance problems to expect and what alternatives exist. Preferably I would like to avoid a distributed storage, if possible and practical.

修改

感谢所有的意见和建议。关于该项目的一些更多的奖金信息:

Thanks for all the comments and suggestions. Some more bonus info about the project:

这不是一个web的应用程序,其中图像被最终用户提供。没有披露太多,因为这是在合同相,它更在质量控制的范畴。想用输送带和传感器生产厂。这不是传统的质量控制,因为产品的价值完全依赖于图像和元数据库顺利工作。

This is not a web-application where images are supplied by end-users. Without disclosing too much, since this is in the contract phase, it's more in the category of quality control. Think production plant with conveyor belt and sensors. It's not traditional quality control since the value of the product is entirely dependent on the image and metadata database working smoothly.

的图像被访问的99%以上通过在第一以自治应用 - 先出的顺序,但也会发生由用户应用的随机接入。不到一天旧图像将主要用于存档的目的,但该目的也很重要。

The images are accessed 99% by an autonomous application in first in - first out order, but random access by a user application will also occur. Images older than a day will mainly serve archive purposes, though that purpose is also very important.

图像到期遵循各种原因复杂的规则,但在某些日期的所有图像应予删除。删除规则遵循依赖于元数据和用户交互的业务逻辑。

Expiration of the images follow complex rules for various reasons, but at some date all images should be deleted. Deletion rules follow business logic dependent on metadata and user interactions.

有将被停机的每一天,在这里可以进行维修。

There will be downtime each day, where maintenance can be performed.

preferably文件存储不会有沟通影像的位置回到了元数据服务器。图片位置应该从元数据中唯一扣除,有可能虽然映射数据库,如果选择某种散列或分布式系统。

Preferably the file storage will not have to communicate image location back to the metadata server. Image location should be uniquely deducted from metadata, possibly though a mapping database, if some kind of hashing or distributed system is chosen.

所以我的问题是:


  • 哪些技术会做一个强大的工作吗?

  • 哪些技术将具有最低的成本实现?

  • 哪些技术将是最容易受到客户的IT-部门维护?

  • 什么是风险,有没有这种规模给定的技术(5-20​​ TB的数据,10-100万个文件)?

推荐答案

下面是关于实施,并根据follwing假设可能出现的问题的一些随想:100KB的平均图像尺寸,以及50M的稳定状态(5GB)图像。这也假定用户不会直接访问文件存储,并且将做到这一点通过软件或网站:

Here's some random thoughts on implementation and possible issues based on the follwing assumptions: average image size of 100kb, and a steady state of 50M (5GB) images. This also assumes users will not be accessing the file store directly, and will do it through software or a web site:


  1. 存储介质:你给达一个相当微不足道的读取和写入速度的图像的大小,我认为最常见的硬盘不会有问题与此吞吐量。我把他们的数据安全RAID1配置,但是。备份似乎不会成为太大的问题,因为它是唯一的数据5GB。

  1. Storage medium: The size of images you give amounts to a rather paltry read and write speeds, I would think most common hard drives wouldn't have an issue with this throughput. I would put them in a RAID1 configuration for data security, however. Backups wouldn't appear to be too much of an issue, since it's only 5gb of data.

文件存储:要在目录中最大的文件prevent的问题,我想借此哈希(MD5最低,这将是最快的,但大多数碰撞可能与人们面前叽叽喳喳的说。 MD5被打破,这是识别,而不是安全性。攻击者可以垫图像,第二preimage攻击,并与goatse替换所有的图片,但我们会考虑这种可能性不大),并转换成具有为十六进制串。然后,当谈到时间藏匿在文件系统中的文件,以在2个字符块中的十六进制字符串,以及用于基于该该文件创建一个目录结构。例如。如果文件散列 ABCDEF ,根目录将是 AB 然后在一个名为 CD ,其下则可以将图像存储与 ABCDEF 。实名将保持别的地方(下面讨论)。

File storage: To prevent issues with maximum files in a directory, I would take the hash (MD5 minimum, this would be the quickest, but most-collision likely. And before people chirp in to say MD5 is broken, this is for identification, and not security. An attacker could pad images for a second preimage attack, and replace all images with goatse, but we'll consider this unlikely), and convert that has to a hexadecimal string. Then, when it comes time to stash the file in the file system, take the hex string in blocks of 2 characters, and create a directory structure for that file based on that. E.g. if the file hashes to abcdef, the root directory would be ab then under that a directory called cd, under which you would store the image with the name of abcdef. The real name will be kept somewhere else (discussed below).

通过这种方法,如果你开始打文件系统限制(或性能问题)的太多目录中的文件,你可以有文件存储部分创建目录的另一个层次。你也可以使用元数据有多少文件与创建多级目录存储,所以如果你在以后进行扩展,旧的文件将不会在较新的,更深层次的目录找。

With this approach, if you start hitting file system limits (or performance issues) from too many files in a directory, you can just have the file storage part create another level of directories. You could also store with the metadata how many levels of directories the file was created with, so if you expand later, older files won't be looked for in the newer, deeper directories.

在这里另一个好处:如果你打的传输速度的问题,或文件系统问题一般情况下,您可以轻松地一掀起文件分拆到其他驱动器。只是改变了软件,以保持最高级别的目录不同的驱动器。所以,如果你想劈成两半商店,00-7F一个驱动器,另一个80-FF。

Another benefit here: If you hit transfer speed issues, or file system issues in general, you could easily split off a set off files to other drives. Just change the software to keep the top level directories on different drives. So if you want to split the store in half, 00-7F on one drive, 80-FF on another.

散列网还你单实例存储,这可能是好的。由于文件的正常人群的哈希值往往是随机的,这应该也是净您在所有目录中的文件,即使分配。

Hashing also nets you single instance storage, which can be nice. Since hashes of a normal population of files tend to be random, this should also net you an even distribution of files across all directories.

元数据存储:虽然50M行好像很多,大多数DBMS的是建立在这个数字的记录嗤之以鼻,有足够的内存,当然。以下是基于SQL Server写的,但我敢肯定,其中的大部分将适用于其他人。与文件作为主键的哈希创建一个表,用的东西,如大小,格式,和筑巢的水平一起。然后用人工键(一个int标识列将被罚款了这一点),而且该文件的原始名称(VARCHAR(255)或其他),和散列作为外键创建另一个表回到第一个表,而且它加入的时间,与对文件名的列的索引。同时添加你需要弄清楚如果一个文件已到期或没有任何其他列。这将让你存储原始的名字,如果你有人试图把同样的文件中以不同的名称(但在其他方面是相同的,因为它们哈希相同)。

Metadata storage: While 50M rows seems like a lot, most DBMS's are built to scoff at that number of records, with enough RAM, of course. The following is written based on SQL Server, but I'm sure most of these will apply to others. Create a table with the hash of the file as the primary key, along with things like the size, format, and level of nesting. Then create another table with an artificial key (an int Identity column would be fine for this), and also the original name of the file (varchar(255) or whatever), and the hash as a foreign key back to the first table, and the date that it was added, with an index on the file name column. Also add any other columns you need to figure out if a file is expired or not. This will let you store the original name if you have people trying to put the same file in under different names (but are otherwise identical, since they hash the same).

保养:这应该是一个计划任务。让Windows担心当你的任务运行时,少给你调试有关,并得到错误的(如果你每天晚上做保养上午2:30,而你的地方,观察节能夏/夏时制。上午2:30不会发生春季转换期间)。然后,该服务将运行对数据库的查询,以确定哪些文件已过期(基于每个文件名存储的数据,因此它知道何时指向存储文件的所有引用都过期了。不是由引用的任何散列文件在文件名中的表的至少一个行不再需要)。然后,服务会去删除这些文件。

Maintenance: This should be a scheduled task. Let Windows worry about when your task runs, less for you to debug and get wrong (what if you do maintenance every night at 2:30AM, and you're somewhere that observes Summer/daylight saving time. 2:30AM doesn't happen during the spring changeover). This service will then run a query against the database to establish which files are expired (based on the data stored per-file name, so it knows when all references that point to a stored file are expired. Any hashed file that is not referenced by at least one row in the file name table is no longer needed). The service would then go delete these files.

我觉得这是它的主要部分。

I think that's about it for the major parts.

编辑:我的意见是越来越太长,将其移动到编辑:

My comment was getting too long, moving it into an edit:

哎呦,我的错,那就是我得到做数学时,我累了。在这种情况下,如果你想避免增加RAID级别(51或61如跨条带集镜像)的额外的冗余,散列将净你能够插槽5 1TB的驱动器到服务器的好处,然后都该文件存储软件跨越了哈希驱动器一样在2月底提到你甚至可以RAID1,为增加安全性的驱动器这一点。

Whoops, my mistake, that's what I get for doing math when I'm tired. In this case, if you want to avoid the extra redundancy of adding RAID levels (51 or 61 e.g. mirrored across a striped set), the hashing would net you the benefit of being able to slot 5 1TB drives into the server, and then have the file storage software span the drives by the hash like mentioned at the end of 2. You could even RAID1 the drives for added security for this.

备份会比较复杂,但文件系统创建/修改时间仍然会保持这样做(你可以有它触及每个文件以更新它的当增加一个新的参照该文件的修改时间)。

Backing up would be more complex, though the file system creation/modification times would still hold for doing this (You could have it touch each file to update it's modification time when a new reference to that file is added).

我看到双重缺点由日期/时间的目录去。首先,这是不可能的分布是均匀的,这将导致一些目录比别人更充分。散列将均匀分布。对于跨越,你可以在添加文件,并开始蔓延到下一个驱动器时空间用完监控驱动器上的空间。我想象中的到期部分日期相关的,所以你有旧驱动器开始空的,因为较新的补了,你必须弄清楚如何平衡这一点。

I see a two-fold downside to going by date/time for the directories. First, it is unlikely the distribution would be uniform, this will cause some directories to be fuller than others. Hashing would distribute evenly. As for the spanning, you could monitor the space on the drive as you add files, and start spilling over to the next drive when space runs out. I imagine part of the expiry is date related, so you would have older drives start to empty as newer ones fill up, and you'd have to figure out how to balance that.

该元数据存储不必是服务器本身。你已经在数据库中存储文件的相关数据。而不是仅仅直接从那里它被使用的行引用的路径,参考文件名键(我所提到的第二个表),而不是

The metadata store doesn't have to be on the server itself. You're already storing file related data in the database. As opposed to just referencing the path directly from the row where it is used, reference the file name key (the second table I mentioned) instead.

我想象用户使用某种形式的Web或应用程序的接口到店,所以智慧找出其中的文件会去的存储服务器上就生活在那里,只是分享了驱动器的根(或做使用NTFS junctioning一些花哨的东西把所有的驱动器到一个子目录)。如果你希望通过一个网站拉下一个文件,创建需要的文件名ID网站上的页面,然后在数据库中进行查询,以获得哈希值,那么将打破哈希到什么配置级,并要求在共享到服务器,然后流回给客户机。如果期待一个UNC访问该文件,让服务器刚刚建立,而不是UNC

I imagine users use some sort of web or application to interface to the store, so the smarts to figure out where the file would go on the storage server would live there, and just share out the roots of the drives (or do some fancy stuff with NTFS junctioning to put all the drives into one subdirectory). If you're expecting to pull down a file via a web site, create a page on the site that takes the file name ID, then perform the lookup in the DB to get the hash, then it would break the hash up to whatever configured level, and request that over the share to the server, then stream it back to the client. If expecting a UNC to access the file, have the server just build the UNC instead.

这两种方法都将使你的最终用户的应用程序不依赖于文件系统本身的结构,会使你更容易调整和后来扩展存储。

Both of these methods would make your end-user app less dependent on the structure on the file system itself, and will make it easier for you to tweak and expand your storage later.

这篇关于大型图像存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆