Java:基于磁盘的快速哈希集 [英] Java: fast disk-based hash set

查看:88
本文介绍了Java:基于磁盘的快速哈希集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要存储一个大的哈希集,最多可以包含大约2亿个40位值。将其存储为2亿64位值是可以接受的(尽管有2亿* 16位丢失)。

I need to store a big hash set, able to contain up to approx 200 millions 40 bit values. Storing it as 200 millions 64 bit value would be acceptable (despite the 200 millions * 16 bits loss).

要求是:


  • 微小的内存占用(磁盘空间不是问题,内存是)

  • tiny memory footprint (disk space ain't an issue, memory is)

快速包含(long l) add(long l)方法(比SQL快得多)

fast contains(long l) and add(long l) methods (much faster than SQL)

嵌入式

免费且没有讨厌的许可(没有Berkeley DB)。 LGPL罚款。

free and without nasty licensing (no Berkeley DB). LGPL fine.

没有误报,也没有误报,所以像基于磁盘的Bloom Filters这样的东西不是我追求的

no false positive and no false negative, so things like disk-based Bloom Filters are not what I'm after

SQL 我在此之后的事情。

SQL is not what I'm after here .

因为我真的认为我更喜欢 fast 这样的事情(注意解决方案比SQL解决方案快得多):

Because I really think I'm more after something fast like this (notice how the solution is much faster than a SQL solution):

基于磁盘的快速哈希表?

Google是否有这样的Java API?

Does Google have such a Java API?

基于磁盘的快速键/值对实现我只使用'key'吗?

Would a fast disk-based key/value pair implementation where I'd only use the 'key' work?

或其他什么?

我宁愿不重新发明。

推荐答案

尝试sg-cdb(djb的cdb的奇怪的gizmo端口),然后用BufferedRandomAccessFile交换RandomAccessFile(在jai-imageio代码中有一个很好的一个) )。

Try sg-cdb (strange gizmo port of djb's cdb), and then swap out the RandomAccessFile with a BufferedRandomAccessFile (there's a good one in the jai-imageio code).

我是心灵的写作表现。通过屋顶(因为它全部缓冲然后一举进行)。
缓冲的RAF的读取性能不会改变。

I'm getting mind blowing write performance. Through the roof (cuz it's all buffered then committed in one fell swoop). Read performance for the buffered RAF is not changed, though.

我可以花时间与H2批量导入进行比较,虽然不确定但可能具有可比性。

I could take the time to compare with H2 bulk import, might be comparable though not sure.

数据库很简单:key => value。查找仅在密钥上受支持。
键是在我的情况下(base32编码随机int + base32编码唯一int)所以该地方不应该真正帮助太多。这些值是120个随机字节的数组。

The database is simple: key => value. Lookup only supported on key. The keys are in my case (base32 encoded random ints + base32 encoded unique int) so the locality shouldn't really be helping much. The values are arrays of 120 random bytes.

2011年5月4日11:45:10 PM test.TestH2Simple main
:插入执行添加:101625 ms

May 4, 2011 11:45:10 PM test.TestH2Simple main : inserts performed added in:101625 ms

数据库大小:
大约140 MB

database size: About 140 MB

批量大小:2000
:插入执行已添加in:116875 ms

batch size: 2000 : inserts performed added in:116875 ms

批量大小:10000
:insertsperformed添加:70234 ms

batch size: 10000 : insertsperformed added in:70234 ms

使用RandomAccessFile:
insert without flush:21688 ms,flush:30359 ms,总计:52047 ms
磁盘上文件大小:
66.1 MB(69,315,632字节)

with RandomAccessFile: insert without flush:21688 ms, flush:30359 ms, total:52047 ms size of file on disk: 66.1 MB (69,315,632 bytes)

使用BufferedRandomAccessFile:
约6.5秒

with BufferedRandomAccessFile: about 6.5 seconds

当然,这不是一个公平的比较,因为H2在插入过程中连续刷新数据,而sg-cdb则没有。在进行比较时必须牢记这一点。可能公平的是比较sg-cdb插入与H2批量插入

of course, it's not really a fair comparison, since H2 is flushing data continuously during the insert, whereas the sg-cdb is not. This has to be born in mind when performing the comparison. Probably fair would be to compare sg-cdb insert with H2 bulk insert

:搜索:490000
结束:1304544550439花费:17547 ms,计数器:0

: searched:490000 finished on:1304544550439 took:17547 ms, counter:0

:选择执行于:19579 ms

: selects performed in:19579 ms

关于内存映射文件:
他们似乎不是你想要的。 MMap文件的出色表现是当您将大约100MB或更多内容映射到内存时(我的经验)。

Concerning Memory Mapped Files: they don't seem to be what you're looking for. The great performance with MMap files is when you map about 100MB or more into memory (my experience).

这篇关于Java:基于磁盘的快速哈希集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆