C用于压缩顺序正整数的库 [英] C Library for compressing sequential positive integers

查看:159
本文介绍了C用于压缩顺序正整数的库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个常见的问题,就是为一个磁盘内的字符串数组创建一个索引。简而言之,我需要将每个字符串的位置存储在磁盘表示中。例如,一个非常朴素的解决方案将是一个索引数组如下:



uint64 idx [] = {0,20,500,1024,...,103434} ;



这说明第一个字符串在位置0,第二个在位置20,第三个在位置500,第n个在位置103434。 b
$ b

位置始终是非负64位整数,按顺序排列。虽然数字可以通过任何差异而变化,但实际上我预期典型的差异在2 ^ 8到2 ^ 20的范围内。我期望这个索引在内存中被mmap,并且位置将被随机访问(假设均匀分布)。



我在想我自己编写代码某种类型的块增量编码或其他更复杂的编码,但是在编码/解码速度和空间之间有这么多不同的折衷,我宁愿得到一个工作库作为起点,甚至可能解决没有任何自定义的东西。



任何提示?一个c库将是理想的,但是一个c ++也将允许我运行一些初始基准。



如果你还在跟随,还有更多的细节。这将用于构建类似于cdb的库( http://cr.yp.to/ cdb / cdbmake.html )位于媒体库cmph上( http://cmph.sf.net )。总之,它是针对一个大的基于磁盘的只读关联映射与内存中的小索引。



由于它是一个库,我不能控制输入,但是我想要优化的典型用例有数百万个值,平均值大小在几千字节范围内,最大值在2 ^ 31。



记录,如果我没有找到一个准备好使用的库,我打算以64个整数的块实现增量编码,其中初始字节指定到目前为止的块偏移。块本身将用树来索引,给我O(log(n / 64))访问时间。有太多的其他选择,我宁愿不讨论他们。我真的期待着准备使用代码,而不是如何实现编码的想法。



我很感激您的帮助,如果您有任何疑问,请告诉我。

解决方案

我使用 fastbit (Kesheng Wu LBL.GOV),似乎你需要一些好的,快速的和现在的,所以fastbit是一个高度竞争优势的Oracle的BBC(字节对齐位图代码,berkeleydb)。



但是,给定更多的时间,你可能想看一个

丹尼尔·莱米尔(Daniel Lemire)是一位非常受欢迎的人物,在 code.google 上发布的许多针对C / ++ / Java的库,我已经阅读了他的一些论文,他们是相当不错的,fastbit上的几个进步和列重新排序与替换灰色代码的替代方法。



几乎忘了,我也遇到了东京橱柜,虽然我不认为这将是很适合我目前的项目,我可能考虑更多如果我之前知道它),它有很大程度的互操作性,


东京橱柜是写在C
语言,并作为C,
Perl,Ruby,Java和Lua的API提供。东京
Cabinet适用于平台
,其API符合C99和
POSIX。


正如你提到的CDB,TC基准有一个TC模式(TC支持几个操作约束为不同perf),其中超过CDB 10倍的读性能和2倍的写入。



对于您的增量编码要求,我非常有信心 bsdiff ,它能够执行任何file.exe内容修补系统,它也可能有一些基本的接口,为您的一般需要。



Google的新二进制压缩应用程序( courgette )可能值得一试,如果你错过了新闻稿,在我所看到的一个测试用例中,比bsdiff小10倍的diff。


I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:

uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };

Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.

The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).

I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.

Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.

A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.

Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.

For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.

I appreciate your help and let me know if you have any doubts.

解决方案

I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.

However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.

Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.

Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,

Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX.

As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.

With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.

Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.

这篇关于C用于压缩顺序正整数的库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆