高效阅读在C非常大的文本文件++ [英] Efficiently reading a very large text file in C++

查看:148
本文介绍了高效阅读在C非常大的文本文件++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的文本文件(45GB)。文本文件的每一行包含如下所示的两个空间分离的64位无符号整数。

I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.

4624996948753406865 10214715013130​​414417

4624996948753406865 10214715013130414417

4305027007407867230 4569406367070518418

4305027007407867230 4569406367070518418

10817905656952544704 3697712211731468838
...
...

10817905656952544704 3697712211731468838 ... ...

我要阅读的文件,并在数执行某些操作。

I want to read the file and perform some operations on the numbers.

void process_data(string str)
{
    vector<string> arr;
    boost::split(arr, str, boost::is_any_of(" \n"));
    do_some_operation(arr);
}

int main()
{
    unsigned long long int read_bytes = 45 * 1024 *1024;
    const char* fname = "input.txt";
    ifstream fin(fname, ios::in);
    char* memblock;

    while(!fin.eof())
    {
        memblock = new char[read_bytes];
        fin.read(memblock, read_bytes);
        string str(memblock);
        process_data(str);
        delete [] memblock;
    }
    return 0;
}

我是比较新的C ++。当我运行这个code,我面对这些问题。

I am relatively new to c++. When I run this code, I am facing these problems.


  1. 由于读取字节的文件,有时一个块的最后一行对应于原始文件中的一个未完成的行(4624996948753406865 10214,而不是实际的串4624996948753406865 10214715013130​​414417主文件)。

  1. Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).

这code运行速度非常非常慢。它需要大约6secs一个块操作在64位英特尔酷睿i7 920系统,6GB的内存来运行。有没有我可以用它来提高运行时的优化技术?

This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?

是否有必要包括\\ n随着升压分割功能空白字符?

Is it necessary to include "\n" along with blank character in the boost split function?

我看了一下在C ++ MMAP文件,但我不知道它是否是这样做的正确方法。如果是,请附上一些链接。

I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.

推荐答案

我要重新设计此行事流,而不是块。

I'd redesign this to act streaming, instead of on a block.

有一个简单的方法是:

std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});

如果你知道很多值大致预计如何使用的std ::矢量::储备前面可能会进一步加快速度。

If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.

另外,您可以使用内存映射文件和迭代字符序列。

Alternatively you can use a memory mapped file and iterate over the character sequence.


  • <一个href=\"http://stackoverflow.com/questions/17465061/how-to-parse-space-separated-floats-in-c-quickly/17479702?s=3|0.0000#17479702\">How解析空格分隔的花车在C ++快? 显示这些方法与基准浮动。

更新我修改了上面的程序来解析 uint32_t的 s转换为载体。

Update I modified the above program to parse uint32_ts into a vector.

在使用4.5GiB的 [1] 方案9秒运行 [2]

When using a sample input file of 4.5GiB[1] the program runs in 9 seconds[2]:

sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size():   402653184
0:08.96 elapsed, 6 context switches

当然它分配至少402653184 * 4 *字节= 1.5 gibibytes。因此,当
你读一个45 GB的文件,你需要的RAM估计15GiB只是商店
矢量(假设重新分配无碎片):的45GiB解析
在完成10分钟45秒

make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size():   4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches

相比之下,刚刚运行 WC -l 45gib_uint32s.txt 花〜12分钟(不带实时优先级调度虽然)。 厕所 极快

By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>

namespace qi = boost::spirit::qi;

typedef std::vector<uint32_t> data_t;

using hrclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
    if (argc<2) return 255;
    data_t data;
    data.reserve(4392580288);   // for the  45 GiB file benchmark
    // data.reserve(402653284); // for the 4.5 GiB file benchmark

    boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    using namespace qi;

    auto start_parse = hrclock::now();
    bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
    auto stop_time = hrclock::now();

    if (ok)   
        std::cout << "Parse success\n";
    else 
        std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";

    if (f!=l) 
        std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";

    std::cout << "Data.size():   " << data.size() << "\n";
    std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
}


[1] OD -t U4的/ dev / urandom的-A没有-v -w4产生|光伏| DD BS = 1M计算= $((9 * 1024/2))= IFLAG&fullblock GT; smaller.txt

[2] 显然,这是在Linux上的缓冲区高速缓存文件缓存 - 大文件没有这样做的好处

[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit

这篇关于高效阅读在C非常大的文本文件++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆