C库为COM pressing连续的正整数 [英] C Library for compressing sequential positive integers
问题描述
我有一个在磁盘阵列的字符串创建索引的很常见的问题。总之,我需要每个字符串的位置存储在中盘重新presentation。例如,如下一个非常幼稚的解决办法是一个索引数组:
I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
UINT64 IDX [] = {0,20,500,1024 ...,103434};
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
这表示,所述第一字符串是在位置0,第二个20位,在500位的第三和103434位的第n个
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
的位置始终是按顺序的非负64比特整数。虽然数字可以通过任何差别有所不同,在实践中我希望典型的区别是范围之内,从2 ^ 8比2 ^ 20。我预计该指数在内存中mmap'ed和职位将随机访问(假设均匀分布)。
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
我在想我自己写code做某种块增量编码或其他更复杂的编码方式,但也有编码/解码速度和空间之间的许多不同的取舍,我宁愿得到图书馆的工作为出发点,甚至定居的东西没有任何自定义。
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
任何提示? C库将是理想的,但在C ++人们也将让我跑了一些初步的基准。
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
一个一些细节如果你还在下面。这将被用来建立类似于国家开发银行( http://cr.yp.to/cdb/cdbmake.html <库/ A>)顶部的库CMPH( http://cmph.sf.net )。总之,它是一个基于读取内存中的一个小指标只关联地图的大磁盘。
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
既然是一个图书馆,我没有在输入控件,但我要优化典型的用例有数亿值的,在几千字节范围平均值的大小,并在2 ^ 31最大值
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
有关的记录,如果我没有找到准备用我打算实施的64个整数与指定块到目前为止抵消初始字节的块增量编码库。该块本身将与树索引,让我O(日志(N / 64))访问时间。有太多的其他选择,我会preFER不讨论这些问题。我真的很期待准备使用code,而不是如何实现编码的想法。我会很高兴和大家一起分享我做什么,一旦我有工作。
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
我AP preciate你们的帮助,让我知道,如果您有任何疑问。
I appreciate your help and let me know if you have any doubts.
推荐答案
我使用 fastbit (咳声吴LBL.GOV) ,看来你需要的东西好,速度快而现在,所以fastbit是Oracle的英国广播公司(字节对齐位code,BerkeleyDB的)高度competient改善。这很容易安装和非常好的gernally。
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
不过,考虑更多的时候,你可能想看看在灰色code 一>解决方案,它似乎最适合你的目的。
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
丹尼尔·勒迈尔拥有一批在 code.google <发布了C / ++ / Java库/ A>,我读了他的一些论文和他们在fastbit和列重新排序的替代方法以置换的灰色codeS的相当不错,一些进步。
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
差点忘了,我也碰到东京内阁的,虽然我不认为这将是非常适合我目前的项目中,我可能更多考虑的是,如果我以前知道关于它;),它有一个很大程度的互操作性,
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
东京内阁用C
语言,为C的API提供,
Perl中,红宝石,爪哇,和Lua。东京
内阁可在平台
其中有API符合C99和
POSIX。
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX.
正如你提到CDB的TC基准具有TC模式(TC支持的一些业务限制对不同PERF)凡超过国开行10倍的读性能和2次写操作。
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
对于您的增量编码的要求,我在 bsdiff 并就对了,执行任何文件的能力相当有信心。 exe文件的内容补丁的系统,它可能也有一些fundimental接口为您的日常需求。
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
谷歌的新的二进制COM pression应用,胡瓜可能是值得一试出,如果你错过了preSS发布,10小差异的比一个测试用例我见过发表bsdiff。
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
这篇关于C库为COM pressing连续的正整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!