在 C++ 中有效读取非常大的文本文件 [英] Efficiently reading a very large text file in C++

查看:85
本文介绍了在 C++ 中有效读取非常大的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的文本文件(45GB).文本文件的每一行包含两个空格分隔的 64 位无符号整数,如下所示.

I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.

4624996948753406865 10214715013130414417

4624996948753406865 10214715013130414417

4305027007407867230 4569406367070518418

4305027007407867230 4569406367070518418

10817905656952544704 3697712211731468838......

10817905656952544704 3697712211731468838 ... ...

我想读取文件并对数字执行一些操作.

I want to read the file and perform some operations on the numbers.

void process_data(string str)
{
    vector<string> arr;
    boost::split(arr, str, boost::is_any_of(" 
"));
    do_some_operation(arr);
}

int main()
{
    unsigned long long int read_bytes = 45 * 1024 *1024;
    const char* fname = "input.txt";
    ifstream fin(fname, ios::in);
    char* memblock;

    while(!fin.eof())
    {
        memblock = new char[read_bytes];
        fin.read(memblock, read_bytes);
        string str(memblock);
        process_data(str);
        delete [] memblock;
    }
    return 0;
}

我对 C++ 比较陌生.当我运行这段代码时,我遇到了这些问题.

I am relatively new to c++. When I run this code, I am facing these problems.

  1. 由于以字节为单位读取文件,有时一个块的最后一行对应于原始文件中的未完成行(4624996948753406865 10214"而不是实际的字符串4624996948753406865 102147374014"的主文件).

  1. Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).

这段代码运行得非常非常慢.在具有 6GB RAM 的 64 位 Intel Core i7 920 系统中运行一个块操作大约需要 6 秒.是否有任何优化技术可以用来改进运行时?

This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?

是否有必要在 boost 拆分函数中包含 "和空白字符?

Is it necessary to include " " along with blank character in the boost split function?

我已经阅读了 C++ 中的 mmap 文件,但我不确定这是否是正确的方法.如果是,请附上一些链接.

I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.

推荐答案

我会重新设计它以进行流式处理,而不是在块上.

I'd redesign this to act streaming, instead of on a block.

更简单的方法是:

std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});

如果您大致知道预期的值有多少,预先使用 std::vector::reserve 可以进一步加快速度.

If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.

或者,您可以使用内存映射文件并遍历字符序列.

Alternatively you can use a memory mapped file and iterate over the character sequence.

更新我修改了上面的程序,将uint32_ts解析成一个向量.

Update I modified the above program to parse uint32_ts into a vector.

当使用 4.5GiB 的示例输入文件时[1]程序在 9 秒内运行[2]:

When using a sample input file of 4.5GiB[1] the program runs in 9 seconds[2]:

sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size():   402653184
0:08.96 elapsed, 6 context switches

当然它至少分配了 402653184 * 4 * byte = 1.5 Gibibytes.所以当你读了一个 45 GB 的文件,你估计需要 15GiB 的 RAM 来存储向量(假设重新分配时没有碎片):45GiB 解析在 10 分 45 秒内完成:

Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when you read a 45 GB file, you will need an estimated 15GiB of RAM to just store the vector (assuming no fragmentation on reallocation): The 45GiB parse completes in 10min 45s:

make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size():   4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches

相比之下,仅运行 wc -l 45gib_uint32s.txt 需要大约 12 分钟(尽管没有实时优先级调度).wc 极快

By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>

namespace qi = boost::spirit::qi;

typedef std::vector<uint32_t> data_t;

using hrclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
    if (argc<2) return 255;
    data_t data;
    data.reserve(4392580288);   // for the  45 GiB file benchmark
    // data.reserve(402653284); // for the 4.5 GiB file benchmark

    boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    using namespace qi;

    auto start_parse = hrclock::now();
    bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
    auto stop_time = hrclock::now();

    if (ok)   
        std::cout << "Parse success
";
    else 
        std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'
";

    if (f!=l) 
        std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters
";

    std::cout << "Data.size():   " << data.size() << "
";
    std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s
";
}

<小时>

[1] 使用 od -t u4/dev/urandom -A none -v -w4 | 生成光伏 |dd bs=1M count=$((9*1024/2)) iflag=fullblock >较小的.txt

[2] 显然,这是缓存在 linux 缓冲区缓存中的文件 - 大文件没有这个好处

[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit

这篇关于在 C++ 中有效读取非常大的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆