在Java中的已排序(内存映射?)文件中进行二进制搜索 [英] Binary search in a sorted (memory-mapped ?) file in Java

查看:125
本文介绍了在Java中的已排序(内存映射?)文件中进行二进制搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力将Perl程序移植到Java,并且随时学习Java。原始程序的核心组件是 Perl模块使用二进制搜索
在+500 GB排序文本文件中进行字符串前缀查找(本质上,搜索到文件中间的字节偏移,回溯到最近的换行,比较行前缀与搜索字符串,寻找到那个字节偏移的一半/两倍,重复直到找到...)

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search (essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)

我已经尝试了几个数据库解决方案,但发现没有什么比这更好了使用此大小的数据集实现查找速度。您知道任何实现此类功能的现有Java库吗?如果做不到这一点,你能指出一些惯用的示例代码,它可以在文本文件中进行随机访问读取吗?

I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?

或者,我不熟悉新的(?)Java I / O库但它是一个内存映射500 GB文本文件的选项(我在64位机器上备用内存)并在内存映射字节数组上进行二进制搜索?我非常有兴趣听到你必须分享的关于这个和类似问题的任何经验。

Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

推荐答案

我是 Java的 粉丝对于这种情况,noreferrer> MappedByteBuffers 。它的速度非常快。下面是我为您整理的一个片段,它将缓冲区映射到文件,向中间搜索,然后向后搜索换行符。这应该足以让你去?

I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?

我在我自己的应用程序中有类似的代码(搜索,读取,重复完成),基准测试
java.io 在生产环境中针对 MappedByteBuffer 的流并将结果发布在我的博客上( Geekomatic帖子标记为'java.nio'),包含原始数据,图表等。

I have similar code (seek, read, repeat until done) in my own application, benchmarked java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.

两秒钟摘要? 基于我的 MappedByteBuffer 的实施速度提高了约275%。 YMMV。

Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.

工作对于大于~2GB的文件,这是一个问题,因为演员和 .position(int pos),我制作了由<$ c数组支持的分页算法$ C> MappedByteBuffer 秒。您需要使用64位系统来处理大于2-4GB的文件,因为MBB使用操作系统的虚拟内存系统来实现他们的魔力。

To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

这篇关于在Java中的已排序(内存映射?)文件中进行二进制搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆