数据结构的最佳存储快速查找和持久性 [英] Optimal storage of data structure for fast lookup and persistence

查看:175
本文介绍了数据结构的最佳存储快速查找和持久性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

场景



我有以下方法:

  public void AddItemSecurity(int itemId,int [] userIds)
public int [] GetValidItemIds(int userId)

最初我正在考虑在窗体上存储:

  itemId  - > userId,userId,userId 

  userId  - > itemId,itemId,itemId 

AddItemSecurity 关于如何从第三方API获取数据, GetValidItemIds 是我如何在运行时使用它。



有2000个用户和1000万个项目。
项目编号在表单上:2007123456,2010001234(10位数字,前四位代表年份)。



AddItemSecurity 不需要执行超快,但是 GetValidIds 需要是次要的。另外,如果现有 itemId 上有更新,我需要删除该用户不再在该列表中的该itemId。



我试图想一想如何以最佳的方式存储。最好在磁盘上(有缓存),但是我希望代码可维护和干净。



如果项目ID已经从0开始,我考虑创建一个字节数组长度的每个用户的 MaxItemId / 8 ,并且如果项目存在或不存在,则设置真/假位。这将限制每个用户的阵列长度小于1MB,并提供快速查找以及更新每个用户列表的简单方法。通过将其持久化为的内容映射文件的内容映射的文件的内容映射文件的目标框架Moniker-%22.NETFRAMEWORK,VERSION%3dV4.0%22); k(DevLang-CSHARP)& rd = truerel =nofollow noreferrer我也会得到体面的缓存(如果机器有足够的内存),而不自己实现缓存逻辑。解析ID,剥离年份,每年存储数组可能是一个解决方案。



ItemId - > UserId []列表可以直接序列化到磁盘和读取/写入一个正常的 FileStream ,以便持久化列表,并在有变化时进行区分。



每个时间一个新的用户被添加所有的列表也必须更新,但这可以每晚完成。



问题



我应该继续尝试这种方法,还是还有其他应该探索的路径?我认为SQL服务器的执行速度将不够快,并且会产生开销(至少如果它在不同的服务器上托管),但我的假设可能是错误的。对此事的任何想法或见解感到欣慰。而且我想尝试解决它,而不增加太多的硬件:)



[更新2010-03-31]



我现在在以下条件下使用SQL Server 2008进行了测试。




  • 具有两列(userid,itemid)的表都是Int

  • 两列上的聚集索引

  • 为180位用户添加了约800.000项 - 共计1.44亿行

  • 为SQL Server分配4gb ram

  • 双核2.66ghz笔记本电脑

  • SSD磁盘

  • 使用SqlDataReader读取所有itemid到列表中

  • 循环所有用户



如果我运行一个线程,则平均为0.2秒。当我添加第二个线程时,最多可以达到0.4秒,这还是可以的。从那里的结果正在减少。添加第三个线程会使查询长达2秒。第四个线程,最多4秒钟,第五次将某些查询峰值高达50秒。



CPU正在进行屋顶,即使在一个线程上。我的测试应用程序需要一些,因为快速的循环,而其余的SQL。



这导致我的结论,它不会很好地扩展。至少不在我测试的硬件上。有没有办法优化数据库,比如存储每个用户的int数组而不是一个记录。但是这样做更难删除项目。



[更新2010-03-31#2] p>

我用相同的数据进行了快速测试,将其作为内存映射文件中的位。它表现更好。六个线程的访问时间在0.02s和0.06s之间。纯记忆界限。映射的文件被一个进程映射,同时由六个其他进程访问。而且由于sql base占用了4GB,磁盘上的文件占用了23MB。

解决方案

经过多次测试,我最终使用内存映射文件,使用稀疏位(NTFS)标记它们,使用 NTFS稀疏文件与C#



维基百科有一个解释,一个稀疏文件是。



使用稀疏文件的好处是我不必在乎我的id是什么范围。如果我在2006000000和2010999999之间只写id,那么这个文件只能从文件中的250,750,000的偏移量中分配625,000个字节。到该偏移量的所有空间都在文件系统中未分配。每个ID存储为文件中的一个位。被排列为一个位数组。如果id序列突然改变,那么它将分配给文件的另一部分。



为了检索哪个id被设置,我可以执行一个操作系统调用获取稀疏文件的分配部分,然后我检查这些序列中的每一位。还要检查一个特定的id是否设置得非常快。如果它落在分配的块之外,那么它不在那里,如果它在内部,它只是一个字节读取和位掩码检查,看看是否设置正确的位。



所以对于特定场景,你有很多id需要尽可能多的速度检查,这是迄今为止找到的最佳方式。



好的部分是内存映射文件也可以与Java共享(这被证明是需要的)。 Java还支持Windows上的内存映射文件,实现读/写逻辑是相当微不足道的。


Scenario

I have the following methods:

public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)

Initially I'm thinking storage on the form:

itemId -> userId, userId, userId

and

userId -> itemId, itemId, itemId

AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.

There are potentially 2000 users and 10 million items. Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).

AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.

I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.

If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.

The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.

Each time a new user is added all the lists have to updated as well, but this can be done nightly.

Question

Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)

[Update 2010-03-31]

I have now tested with SQL server 2008 under the following conditions.

  • Table with two columns (userid,itemid) both are Int
  • Clustered index on the two columns
  • Added ~800.000 items for 180 users - Total of 144 million rows
  • Allocated 4gb ram for SQL server
  • Dual Core 2.66ghz laptop
  • SSD disk
  • Use a SqlDataReader to read all itemid's into a List
  • Loop over all users

If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.

The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.

Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.

[Update 2010-03-31 #2]

I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

解决方案

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.

Wikipedia has an explanation of what a sparse file is.

The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.

In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.

So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.

And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

这篇关于数据结构的最佳存储快速查找和持久性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆