用于在磁盘上有效存储整数对的数据结构选项? [英] Data structure options for efficiently storing sets of integer pairs on disk?

查看:122
本文介绍了用于在磁盘上有效存储整数对的数据结构选项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆处理文档群集的代码。一步涉及计算每个文档与给定语料库中的每个其他文档的相似性(对于类似的一些不重要的定义),并存储相似性以供以后使用。这些相似之处是有争议的,我并不关心具体的相似性是什么,我的分析的目的,只是它在什么桶。例如,如果文件15378和3278是52%相似,有序对(3278,15378)得到存储在[0.5,0.6]桶中。在初始分析之后,文档有时会在语料库中添加或删除,因此根据需要将相应的对添加到存储桶中或从存储桶中删除。

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.

我正在查看策略存储这些ID对列表。我们发现一个SQL数据库(这个项目的大多数其他数据都存在)对于我们的目的而言太慢而且磁盘空间太大,所以目前我们将每个存储桶存储为磁盘上的整数压缩列表(最初zlib压缩,但现在使用lz4代替速度)。我喜欢这样的事情:

I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:


  • 阅读和写作都非常快

  • After-the-事实上添加到语料库是相当简单的添加(对于lz4来说比对zlib少一点,因为lz4没有内置的框架机制,但是可行)

  • 同时写入和读取时间,数据可以流式传输,因此不需要一次性保存在内存中,考虑到我们的语料库大小,这将是令人望而却步的。

那种糟糕的事情:


  • 删除是一个巨大的痛苦,基本上涉及通过所有桶进行流式传输并写出来新的,省略任何包含已被删除的文档的ID的对

  • 我怀疑我在速度和紧凑性方面仍然可以做得更好,具有更专用的数据结构和/或压缩策略

那么:我应该关注哪种数据结构?我怀疑正确的答案是某种奇特的简洁数据结构,但这不是我所熟知的空间。此外,如果重要:所有文档ID都是无符号的32位整数,处理这些数据的当前代码是用C语言编写的,作为Python扩展,所以这可能是我们坚持的通用技术系列。

So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.

推荐答案

为什么不存储包含自上次重写以来删除的内容的表格?

Why not just store a table containing stuff that was deleted since the last re-write?

此表可能与您的主存储桶结构相同,也可能使用Bloom过滤器进行快速成员资格检查。

This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.

您可以在没有删除项目的情况下重新编写主存储桶数据,无论如何要进行其他修改,或者当已删除项目:存储区大小的比例超过某些项目时阈值。

You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.

此方案可以通过将每个已删除的对存储在每个存储桶旁边,也可以通过为所有存储单存储一个表来实现删除的文件:我不确定哪种更适合您的要求。

This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.

保留一个表,除非你知道它会影响多少个桶,否则很难知道什么时候你可以移除一个项目,而不管只是删除表也会重写所有桶大。这可能有效,但它有点停止世界。

Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.

您还必须对流入的每对进行两次检查(即,(3278,15378),你要检查 3278 15378 已被删除,而不仅仅是检查对(3278,15378)是否已被删除。

You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.

相反,每个已删除对的每个存储桶表需要更长时间才能构建,但检查时要稍快一些,并且在重写存储桶时更容易崩溃。

Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.

这篇关于用于在磁盘上有效存储整数对的数据结构选项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆