从文件读取第n行的快速方法 [英] Fast way to read nth line from file

查看:157
本文介绍了从文件读取第n行的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介

我有一个名为MyProcess的C ++进程,该进程调用了nbLines次,其中nbLines是一个名为InputDataFile.txt的大文件的行数,在该文件中可以找到输入数据.例如通话

I have a C++ process called MyProcess that I call nbLines times, where nbLines is the number of lines of a big file called InputDataFile.txt in which input data are to be found. For example the call

./MyProcess InputDataFile.txt 142

通知MyProcessInputDataFile.txt文件的142行找到输入数据.

Inform MyProcess that the input data are to be found at the line 142 of the InputDataFile.txt file.

问题

问题在于InputDataFile.txt太大(〜150 GB),以至于无法搜索正确的行.灵感形式这篇文章,这是我的(可能不是最佳)代码

The issue is that InputDataFile.txt is so big (~ 150 GB) that the time for searching the correct line is not negligible. Inspired form this post, here is my (possibly not optimal) code

int line = 142;
int N = line - 1;
std::ifstream inputDataFile(filename.c_str());
std::string inputData;
for(int i = 0; i < N; ++i)
    std::getline(inputDataFile, inputData);

std::getline(inputDataFile,inputData);

目标

我的目标是使MyProcessinputData搜索速度更快.

My goal is to make the search of inputData faster for MyProcess.

可能的解决方案

将每行第一个字符的索引与bash中的行号匹配一次会很方便.这样,我可以直接给出感兴趣的第一个字符的索引,而不是将142赋予MyProcess.然后MyProcess可以直接跳到该位置,而不必搜索和计算'\ n'字符.然后它将读取数据,直到遇到"\ n"字符为止.这样可行吗?如何实现呢?

It would be handy to match once the index of the first character of every line with the line number in bash. This way instead of giving 142 to MyProcess, I could give directly the index of the first character of interest. MyProcess could then directly jump to this position without having to search and count the '\n' characters. It would then read the data until a '\n' character is encounter. Is something like this feasible? How could this be implemented?

当然,我欢迎任何其他可减少导入这些输入数据的总计算时间的解决方案.

Of course, I welcome any other solution that would reduce the overall computational time for importing those input data.

推荐答案

正如其他答案中所建议的那样,构建文件映射可能是一个好主意.我这样做的方式(用伪代码)将是:

As Suggested in other answers it could be a good idea to build a map of the file. The way I would do this (in pseudocode) would be:

let offset be a unsigned 64 bit int =0;

for each line in the file 
    read the line
    write offset to a binary file (as 8 bytes rather as chars)
    offset += length of line in bytes

现在,您有了一个地图"文件,该文件是64位整数的列表(文件中的每一行一个).要读取地图,您只需计算地图中所需行的条目位于何处:

Now you have a "Map" file that is a list of 64 bit ints (one for each line in the file). To read the map you just compute where in the map the entry for the line you desire is located:

offset = desired_line_number * 8 // where line number starts at 0
offset2 = (desired_line_number+1) * 8

data_position1 = load bytes [offset through offset + 8] as a 64bit int from map
data_position2 = load bytes [offset2 through offset2 + 8] as a 64bit int from map

data = load bytes[data_position1 through data_position2-1] as a string from data.

这个想法是您读一次数据文件,并将字节偏移量记录在每行开始的文件中,然后使用固定大小的整数类型将偏移量顺序存储在二进制文件中.然后,映射文件的大小应为number_of_lines * sizeof(integer_type_used).然后,您只需要通过计算行号偏移量存储位置的偏移量并读取该偏移量以及下一行的偏移量,即可进入地图文件.从那里开始,您有一个数字范围,以字节为单位,该范围是数据应放置的位置.

The idea is that you read through the data file once and record the byte offset in the file where each line starts and then store the offsets sequentially in a binary file using a fixed size integer type. The map file should then have a size of number_of_lines * sizeof(integer_type_used). You then just have to seek into the map file by calculating the offset of where you stored the line number offset and read that offset as well as the next lines offset. From there you have a numerical range in bytes of where your data should be located.

示例:

数据:

hello\n 
world\n
(\n newline at end of file)

创建地图.

地图:每个分组[数字]代表文件中的8个字节长

Map: each grouping [number] will represent an 8 byte length in the file

[0][7][14]
//or in binary
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001110

现在说我要第2行:

line offset = 2-1 * 8 // offset is 8 

因此,由于我们使用的是基数为0的系统,该系统将是文件中的第9个字节.因此,输出编号由字节9-17组成:

So since we are using a base 0 system that would be the 9th byte in the file. So out number is made up of bytes 9 - 17 which are :

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
//or as decimal
7

因此,现在我们知道出行应从数据文件中的偏移量7开始(此偏移量以1为底,如果从0开始计数则为6).

So now we know that out line should start at offset 7 in our data file (This offset is base 1, it would be 6 if we started counting at 0).

然后我们执行相同的过程以获取下一行14的起始偏移量.

We then do the same process to get the start offset of the next line which is 14.

最后,我们查找字节范围7-14(以1为基数,以6-13以0为基数)并将其存储为字符串并得到world\n.

Finally we look up the byte range 7-14 (base 1, 6-13 base 0) and store that as a string and get world\n.

C ++实现:

#include <iostream>
#include <fstream>

int main(int argc, const char * argv[]) {
    std::string filename = "path/to/input.txt";

    std::ifstream inputFile(filename.c_str(),std::ios::binary);
    std::ofstream outfile("path/to/map/file.bin",std::ios::binary|std::ios::ate);

    if (!inputFile.is_open() || !outfile.is_open()) {
        //use better error handling than this
        throw std::runtime_error("Error opening files");
    }


    std::string inputData;
    std::size_t offset = 0;
    while(std::getline(inputFile, inputData)){
        //write the offset as binary
        outfile.write((const char*)&offset, sizeof(offset));
        //increment the counter
        offset+=inputData.length()+2;
        //add one becuase getline strips the \n and add one to make the index represent the next line
    }
    outfile.close();

    offset=0;

    //from here on we are reading the map
    std::ifstream inmap("/Users/alexanderzywicki/Documents/xcode/textsearch/textsearch/map",std::ios::binary);
    std::size_t line = 2;//your chosen line number
    std::size_t idx = (line-1) * sizeof(offset); //the calculated offset
    //seek into the map
    inmap.seekg(idx);
    //read the binary at that location
    inmap.read((char*)&offset, sizeof(offset));
    std::cout<<offset<<std::endl;

    //from here you just need to lookup from the data file in the same manor


    return 0;
}

这篇关于从文件读取第n行的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆