如何创建和访问索引以进入Java中大文件的特定位置 [英] How can I create and access an index to go in a specific position of a big file in Java

查看:174
本文介绍了如何创建和访问索引以进入Java中大文件的特定位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下格式的大文件:

I have this large file with the follow format:

唯一字符串 \t 信息

在我的程序中,我需要阅读此文件以通过信息 >唯一字符串键。由于性能很重要,我不能每次都读取每一行寻找密钥,除了我无法将文件加载到内存中,因为它太重了。然后我只想读取文件一次,然后使用String键和文件中的位置(以字节为单位)构建索引。这个索引类似于HashMap,其键是唯一字符串,值是文件中出现键的字节。

In my program I need to read this file to get the Information through the Unique String key. Since the performance is important, I can't read each line looking for the key everytime, besides I can't load the file in memory because it is too heavy. Then I'd like to read the file only once and then build an index with the String key and the position(in byte) of that in file. This index is something like a HashMap with the key been the Unique String and the value been the bytes in file where the key appears.

似乎RandomAccessFile可以做到这一点,但是我不知道怎么做。

Seems that RandomAccessFile could do this, but I don't know how.

那么,我如何构建这个索引,然后通过这个索引访问一个特定的行?

So, how can I build this index and then access an specific line by this index?

推荐答案

我建议的方法是读取文件,并跟踪位置。将该位置沿着地图存储在地图中,以便稍后查找。

The way I am going to suggest is to read the file, and keep track of the position. Store the position along the way in a map so you can look it up later.

第一种方法是将文件用作 DataInput ,并使用 RandomAccessFile#readline

The first way to do this is to use your file as a DataInput, and use the RandomAccessFile#readline

RandomAccessFile raf = new RandomAccessFile("filename.txt", "r");
Map<String, Long> index = new HashMap<>();

现在,您的数据是如何存储的?如果它是逐行存储的,并且ecoding符合 DataInput 标准,那么你可以使用。

Now, how is your data stored? If it is stored line by line, and the ecoding conforms to the DataInput standards, then you can use.

long start = raf.getFilePointer();
String line = raf.readLine();
String key = extractKeyFromLine(line);
index.put(key, start);

现在您需要返回并获取数据。

Now anytime you need to go back and get the data.

long position = index.get(key);
raf.seek(position);
String line = raf.readLine();

这是一个完整的例子:

package helloworld;

import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.HashMap;
import java.util.Map;

/**
 * Created by matt on 07/02/2017.
 */
public class IndexedFileAccess {
    static String getKey(String line){
        return line.split(":")[0];
    }
    public static void main(String[] args) throws IOException {
        Map<String, Long> index = new HashMap<>();
        RandomAccessFile file = new RandomAccessFile("junk.txt", "r");
        //populate index and read file.
        String s;
        do{
            long start = file.getFilePointer();
            s = file.readLine();
            if(s!=null){
                String key = getKey(s);
                index.put(key, start);
            }
        }while(s!=null);

        for(String key: index.keySet()){
            System.out.printf("key %s has a pos of %s\n", key, index.get(key));
            file.seek(index.get(key));
            System.out.println(file.readLine());
        }
        file.close();

    }
}

垃圾.txt 包含:

dog:1, 2, 3
cat:4, 5, 6
zebra: p, z, t

最后输出为:

key zebra has a pos of 24
zebra: p, z, t
key cat has a pos of 12
cat:4, 5, 6
key dog has a pos of 0
dog:1, 2, 3

这有很多警告。例如,如果您需要更强大的编码,那么在您第一次阅读它时,您将需要创建一个可以管理编码的阅读器,并且只需使用 RandomAccessFile 作为输入流。如果行太大, readLine()方法将失败。然后你必须设计自己的策略来提取密钥/数据对。

There are many caveats to this. For example, if you need a more robust encoding, then the first time you read it you'll want to create a reader that can manage the encoding, and just use your RandomAccessFile as an input stream. The readLine() method will fail if the lines are too large. Then you would have to devise your own strategy for extracting the key/data pair.

这篇关于如何创建和访问索引以进入Java中大文件的特定位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆