最简单的阅读映射到内存中的CSV文件的方法吗? [英] Simplest way to read a CSV file mapped to memory?

查看:711
本文介绍了最简单的阅读映射到内存中的CSV文件的方法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在C文件++(11)读我使用内存映射它们:

When I read from files in C++(11) I map them in to memory using:

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());

这很好,当我想用​​字节极快的读取字节。不过,我已经创建了我想映射到内存,读取每一行和逗号分割每行一个CSV文件。

Which is fine when I wish to read byte by byte extremely fast. However, I have created a csv file which I would like to map to memory, read each line and split each line on the comma.

有没有一种方法,我可以用我上面code的一些修改,这样做吗?

Is there a way I can do this with a few modifications of my above code?

(我映射到内存,因为我有一个可怕的很多记忆,我不想与磁盘/ IO流的瓶颈)。

(I am mapping to memory because I have an awful lot of memory and I do not want any bottleneck with disk/IO streaming).

推荐答案

下面是我采取的速度不够快。它通过拉链CSV 116 MIB(2.5Mio行 [1] )在〜1秒。

Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines[1]) in ~1 second.

结果然后在零拷贝随机访问的,所以没有开销(除非页面被换出)。

The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).

对于比较:


      
  • 这就是〜3倍速度更快不是一个天真的 WC csv.txt 发生在同一个文件

  •   
  • 这是关于尽可能快地下面的Perl一个衬里​​(其中列出了所有线路上的不同领域的数):

  • that's ~3x faster than a naive wc csv.txt takes on the same file
  • it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):

perl -ne '$fields{scalar split /,/}++; END { map { print "$_\n" } keys %fields  }' csv.txt


  

  • 这只是慢于(LANG = C WC csv.txt)这(约1.5倍),避免了区域功能

  • it's only slower than (LANG=C wc csv.txt) which avoids locale functionality (by about 1.5x)

    下面是在所有的解析器它的荣耀:

    Here's the parser in all it's glory:

    using CsvField = boost::string_ref;
    using CsvLine  = std::vector<CsvField>;
    using CsvFile  = std::vector<CsvLine>;  // keep it simple :)
    
    struct CsvParser : qi::grammar<char const*, CsvFile()> {
        CsvParser() : CsvParser::base_type(lines)
        {
            using namespace qi;
    
            field = raw [*~char_(",\r\n")] 
                [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
            line  = field % ',';
            lines = line  % eol;
        }
        // declare: line, field, fields
    };
    

    唯一棘手的事情(有唯一的优化)是语义动作来构建一个 CsvField 从源头迭代器与人物的匹配数量。

    The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField from the source iterator with the matches number of characters.

    下面是主要的:

    int main()
    {
        boost::iostreams::mapped_file_source csv("csv.txt");
    
        CsvFile parsed;
        if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
        {
            std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values\n";
        }
    }
    

    打印

    116 MiB parsed into 2578421 lines of CSV values
    

    您可以使用这些值,就像的std ::字符串

    You can use the values just as std::string:

    for (int i = 0; i < 10; ++i)
    {
        auto l     = rand() % parsed.size();
        auto& line = parsed[l];
        auto c     = rand() % line.size();
    
        std::cout << "Random field at L:" << l << "\t C:" << c << "\t" << line[c] << "\n";
    }
    

    它打印例如:

    Random field at L:1979500    C:2    sateen's
    Random field at L:928192     C:1    sackcloth's
    Random field at L:1570275    C:4    accompanist's
    Random field at L:479916     C:2    apparel's
    Random field at L:767709     C:0    pinks
    Random field at L:1174430    C:4    axioms
    Random field at L:1209371    C:4    wants
    Random field at L:2183367    C:1    Klondikes
    Random field at L:2142220    C:1    Anthony
    Random field at L:1680066    C:2    pines
    

    完全工作样本是在这里的 住在Coliru

    The fully working sample is here Live On Coliru

    [1] 我通过反复追加的输出创建的文件

    [1] I created the file by repeatedly appending the output of

    while read a && read b && read c && read d && read e
    do echo "$a,$b,$c,$d,$e"
    done < /etc/dictionaries-common/words
    

    csv.txt ,直到数250万线。

    这篇关于最简单的阅读映射到内存中的CSV文件的方法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆