数据结构的优化存储,快速查找和持久性 [英] Optimal storage of data structure for fast lookup and persistence

查看:230
本文介绍了数据结构的优化存储,快速查找和持久性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

场景

我有以下几种方法:

public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)

起初我在表格上想存储:

Initially I'm thinking storage on the form:

itemId -> userId, userId, userId

userId -> itemId, itemId, itemId

AddItemSecurity 是基于我如何从第三方的API获取数据, GetValidItemIds 是我要如何使用它在运行时。

AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.

有可能2000用户和1000万个项目。 项目ID是在表单上:2007123456,2010001234(10位,其中前四重present年)

There are potentially 2000 users and 10 million items. Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).

AddItemSecurity 没有进行超快速,但 GetValidIds 必须是秒级。此外,如果在更新现有的的itemId 我需要删除的itemId的用户没有在列表更长的时间。

AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.

我试图想我应该如何存储这些以最佳方式。 pferably $ P $磁盘(使用缓存),但我想在code可维护和清洁。

I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.

如果该项目的ID在0已经开始,我想到了创建一个字节数组的 MaxItemId / 8 每个用户的长度,并设置如果真/假位该项目是present与否。这将限制数组的长度为略高于每用户1MB,并给予快速的查询,以及一个简单的方法来更新每个用户的列表。通过坚持以此为<一href="http://msdn.microsoft.com/query/dev10.query?appId=Dev10IDEF1&l=EN-US&k=k%28SYSTEM.IO.MEMORYMAPPEDFILES.MEMORYMAPPEDFILE%29;k%28TargetFrameworkMoniker-%22.NETFRAMEWORK,VERSION%3dV4.0%22%29;k%28DevLang-CSHARP%29&rd=true"相对=nofollow>内存映射文件支持.Net 4框架我想我会得到不错的缓存,以及(如果机器有足够的RAM),而不执行缓存逻辑自己。解析ID,剥出的一年,存储每年的阵列可能是一个解决方案。

If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.

条目标识 - >用户ID []列表中可以直接序列化到磁盘,并阅读,以持续名单/写与正常的FileStream **和diff它时,有修改。

The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.

新用户时,所有的名单已更新的为好,但可以夜间进行。每次

Each time a new user is added all the lists have to updated as well, but this can be done nightly.

问题

我应该继续尝试这种方法,还是应探索以及有其他路径?我想SQL服务器将不会执行速度不够快,它会给出一个开销(至少如果它承载在不同的服务器上),但我的假设可能是错误的。对此事的任何思想或见解是AP preciated。我想尝试去解决它无需增加太多的硬件:)

Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)

[更新2010-03-31]

[Update 2010-03-31]

我现在已经有了SQL Server 2008在下列条件下进行测试。

I have now tested with SQL server 2008 under the following conditions.

  • 在表格有两列(用户ID,项目ID)都是诠释
  • 在两列聚集索引
  • 新增〜800.000项目180用户 - 总共144万行
  • 分配4GB内存对于SQL Server
  • 双核2.66GHz的笔记本电脑
  • SSD硬盘
  • 使用一个SqlDataReader读取所有itemid的年代进入一个List
  • 在遍历所有用户

如果我运行在0.2秒一个线程是平均值。当我添加第二个线程它上升到0.4秒,这仍然是确定。从那里,结果会减少。增加第三螺纹带来很多查询的最多2 seonds。甲所述线程,最多4秒,第五尖峰一些查询达50秒。

If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.

该CPU是屋顶,而这是怎么回事,即使是在一个线程。我的测试应用程序需要一些因快速循环和SQL休息。

The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.

这使我认为它不会很好地进行缩放的结论。在我的测试的硬件至少不会。是否有办法来优化数据库,说存储每个项目一个记录的INT的数组每个用户来代替。但是,这使得难以消除的项目。

Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.

[更新2010-03-31#2]

[Update 2010-03-31 #2]

我做了一个快速测试使用相同的数据把它作为内存映射文件位。它执行好多了。六线程产生访问时间0.02秒和0.06S之间。纯粹的内存约束。该映射文件由一个进程被定位,并通过六个人同时访问。而作为SQL基地采取了4GB,磁盘上的文件了23MB。

I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

推荐答案

在很多测试中,我结束了使用内存映射文件,用疏位(NTFS)标记他们,用code从的 NTFS稀疏文件用C#

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.

维基百科有一个什么样的稀疏文件是一个解释。

Wikipedia has an explanation of what a sparse file is.

使用稀疏文件的好处是,我不必在乎范围我的ID是。如果我只写ID的20.06亿和2010999999之间,该文件将只在文件中分配625000个字节偏移250750000 。最大空间的偏移量是未分配的文件系统。每个ID存储在文件中设置的位。排序的视为位阵列。而如果ID序列突然变化,那么它将会分配到文件的另一部分。

The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.

为了获取其ID的设置,我可以执行操作系统调用来获取稀疏文件分配的部分,然后我检查这些序列的每一位。还检查是否一个特定的ID被设置是非常快的。如果它落在分配的块外,那么它不存在,如果它落入,它只是一个字节读和位屏蔽检查是否正确的位置。

In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.

因此​​,对,你必须要检查尽可能多的速度尽可能多的ID特定的场景中,这是我迄今为止发现的最优化的方式。

So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.

和良好的部分是内存映射文件可以使用Java共享,以及(这被证明是需要的东西)。 Java还具有内存支持映射到Windows文件,贯彻读/写逻辑非常简单。

And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

这篇关于数据结构的优化存储,快速查找和持久性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆