C库为COM pressing连续的正整数 [英] C Library for compressing sequential positive integers

查看:150
本文介绍了C库为COM pressing连续的正整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在磁盘阵列的字符串创建索引的很常见的问题。总之,我需要每个字符串的位置存储在中盘重新presentation。例如,如下一个非常幼稚的解决办法是一个索引数组:

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:

UINT64 IDX [] = {0,20,500,1024 ...,103434};

uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };

这表示,所述第一字符串是在位置0,第二个20位,在500位的第三和103434位的第n个

Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.

的位置始终是按顺序的非负64比特整数。虽然数字可以通过任何差别有所不同,在实践中我希望典型的区别是范围之内,从2 ^ 8比2 ^ 20。我预计该指数在内存中mmap'ed和职位将随机访问(假设均匀分布)。

The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).

我在想我自己写code做某种块增量编码或其他更复杂的编码方式,但也有编码/解码速度和空间之间的许多不同的取舍,我宁愿得到图书馆的工作为出发点,甚至定居的东西没有任何自定义。

I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.

任何提示? C库将是理想的,但在C ++人们也将让我跑了一些初步的基准。

Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.

一个一些细节如果你还在下面。这将被用来建立类似于国家开发银行( http://cr.yp.to/cdb/cdbmake.html <库/ A>)顶部的库CMPH( http://cmph.sf.net )。总之,它是一个基于读取内存中的一个小指标只关联地图的大磁盘。

A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.

既然是一个图书馆,我没有在输入控件,但我要优化典型的用例有数亿值的,在几千字节范围平均值的大小,并在2 ^ 31最大值

Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.

有关的记录,如果我没有找到准备用我打算实施的64个整数与指定块到目前为止抵消初始字节的块增量编码库。该块本身将与树索引,让我O(日志(N / 64))访问时间。有太多的其他选择,我会preFER不讨论这些问题。我真的很期待准备使用code,而不是如何实现编码的想法。我会很高兴和大家一起分享我做什么,一旦我有工作。

For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.

我AP preciate你们的帮助,让我知道,如果您有任何疑问。

I appreciate your help and let me know if you have any doubts.

推荐答案

我使用 fastbit (咳声吴LBL.GOV) ,看来你需要的东西好,速度快而现在,所以fastbit是Oracle的英国广播公司(字节对齐位code,BerkeleyDB的)高度competient改善。这很容易安装和非常好的gernally。

I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.

不过,考虑更多的时候,你可能想看看在灰色code 解决方案,它似乎最适合你的目的。

However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.

丹尼尔·勒迈尔拥有一批在 code.google <发布了C / ++ / Java库/ A>,我读了他的一些论文和他们在fastbit和列重新排序的替代方法以置换的灰色codeS的相当不错,一些进步。

Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.

差点忘了,我也碰到东京内阁的,虽然我不认为这将是非常适合我目前的项目中,我可能更多考虑的是,如果我以前知道关于它;),它有一个很大程度的互操作性,

Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,

东京内阁用C
  语言,为C的API提供,
  Perl中,红宝石,爪哇,和Lua。东京
  内阁可在平台
  其中有API符合C99和
  POSIX。

Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX.

正如你提到CDB的TC基准具有TC模式(TC支持的一些业务限制对不同PERF)凡超过国开行10倍的读性能和2次写操作。

As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.

对于您的增量编码的要求,我在 bsdiff 并就对了,执行任何文件的能力相当有信心。 exe文件的内容补丁的系统,它可能也有一些fundimental接口为您的日常需求。

With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.

谷歌的新的二进制COM pression应用,胡瓜可能是值得一试出,如果你错过了preSS发布,10小差异的比一个测试用例我见过发表bsdiff。

Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.

这篇关于C库为COM pressing连续的正整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆