使用boost ::输入输出流:: mapped_file_source用的std :: multimap中 [英] Using boost::iostreams::mapped_file_source with std::multimap

查看:275
本文介绍了使用boost ::输入输出流:: mapped_file_source用的std :: multimap中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据量,分析 - 每个文件为约5gigs。每个文件的格式如下:

I have a rather large amount of data to analyse - each file is about 5gigs. Each file is of the following format:

xxxxx yyyyy

两个键和值可以重复,但是键递增顺序排序。我试图使用内存映射文件用于此目的,然后找到所需的钥匙,并与他们合作。这是我写的:

Both key and value can repeat, but the keys are sorted in increasing order. I'm trying to use a memory mapped file for this purpose and then find the required keys and work with them. This is what I've written:

if (data_file != "")
{
    clock_start = clock();
    data_file_mapped.open(data_file);

    data_multimap = (std::multimap<double, unsigned int> *)data_file_mapped.data();
    if (data_multimap != NULL)
    {
        std::multimap<double, unsigned int>::iterator it = data_multimap->find(keys_to_find[4]);
        if (it != data_multimap->end())
        {
            std::cout << "Element found.";
            for (std::multimap<double, unsigned int>::iterator it = data_multimap->lower_bound(keys_to_find[4]); it != data_multimap->upper_bound(keys_to_find[5]); ++it)
            {
                std::cout << it->second;
            }
            std::cout << "\n";
            clock_end = clock();

            std::cout << "Time taken to read in the file: " << (clock_end - clock_start)/CLOCKS_PER_SEC << "\n";
        }
        else
            std::cerr << "Element not found at all" << "\n";
    }
    else
        std::cerr << "Nope - no data received."<< "\n";
}

基本上,我需要找到关键点的范围和那些拉出来大块上下工夫。我收到了段错误,我第一次尝试使用方法上的多重映射。例如,当找到方法被调用。我试过 UPPER_BOUND LOWER_BOUND 等方法过了,仍然得到段错误。

Basically, I need to locate ranges of keys and pull those chunks out to work on. I get a segfault the first time I try to use a method on the multimap. For example, when the find method is called. I tried the upper_bound, lower_bound and other methods too, and still get a segfault.

这是什么 GDB 给我:

Program received signal SIGSEGV, Segmentation fault.
_M_lower_bound (this=<optimized out>, __k=<optimized out>, __y=<optimized out>, __x=0xa31202030303833) at /usr/include/c++/4.9.2/bits/stl_tree.h:1261
1261            if (!_M_impl._M_key_compare(_S_key(__x), __k))

可能有人请指出我在做什么错?我只能够找到内存映射文件简单的例子 - 这样的事没有。谢谢你。

Could someone please point out what I'm doing wrong? I've only been able to find simplistic examples on memory mapped files - nothing like this yet. Thanks.

编辑:更多信息:

我上述文件是基本上是一个神经模拟器让我为我的模拟输出的两列纯文本文件。很简单是这样的:

The file I described above is basically a two column plain text file which a neural simulator gives me as the output of my simulations. It's simple like this:

$ du -hsc 201501271755.e.ras 
4.9G    201501271755.e.ras
4.9G    total
$ head 201501271755.e.ras 
0.013800  0
0.013800  1
0.013800  10
0.013800  11
0.013800  12
0.013800  13
0.013800  14
0.013800  15
0.013800  16
0.013800  17

第一列是时间,第二列是,在这个时候解雇神经元 - (这是一个尖峰时间光栅文件)。实际上,输出是从正在使用运行仿真每个MPI等级这样的文件。各种文件已被合并使用排序-g -m 这个主文件。关于文件格式的更多信息是在这里: HTTP://www.fzenke。净/ auryn / doku.php ID =手动:RAS

The first column is time, the second column is the neurons that fired at this time - (it's a spike time raster file). Actually, the output is a file like this from each MPI rank that is being used to run the simulation. The various files have been combined to this master file using sort -g -m. More information on the file format is here: http://www.fzenke.net/auryn/doku.php?id=manual:ras

要计算发射率,并设置在模拟中的某些时段神经元的其他指标,我需要 - 找到该文件中的时间,拉出[-1时,时间]之间的一大块,并运行一些指标和所以就这个块。此文件是相当小的,我希望的大小,增加了不少我的模拟变得越来越复杂,运行更长的时间段。这就是为什么我开始寻找到内存映射文件。我希望有些澄清问题陈述。我只需要读取输出文件来处理它所包含的信息。我并不需要在所有修改此文件。

To calculate the firing rate and other metrics of the neuron set at certain times of the simulation, I need to - locate the time in the file, pull out a chunk between [time -1,time] and run some metrics and so on on this chunk. This file is quite small and I expect the size to increase quite a bit as my simulations get more and more complicated and run for longer time periods. It's why I began looking into memory mapped files. I hope that clarifies the problem statement somewhat. I only need to read the output file to process the information it contains. I do not need to modify this file at all.

要处理的数据,我会再次使用多线程,但因为我在地图上的所有操作都是只读的,我不希望遇到麻烦那里。

To process the data, I'll use multiple threads again, but since all my operations on the map are read-only, I don't expect to run into trouble there.

推荐答案

您正在尝试的东西,你真的不明白:)没有问题。

You're trying things you don't really understand :) No problem.

多图不是在内存布局顺序。 (他们是基于节点的容器,但我离题)。事实上,即使他们是,机会是微乎其微的布局将匹配文本输入。

Multi maps aren't laid out sequentially in memory. (They're node-based containers, but I digress). In fact, even if they were, chances would be slim that the layout would match that of the text input.

有基本上是两种方法可以使这项工作:

There's basically two ways you can make this work:


  1. 请使用多重映射,但使用自定义分配器(让所有的分配是在映射的内存区域完成)。这是从一个高层次的C中的最好的++的观点,/但/你需要更改你的文件的二进制格式。

  1. Keep using the multimap but use a custom allocator (so that all allocations are done in the mapped memory region). This is the "nicest" from a high-level C++ viewpoint, /but/ you will need to change to a binary format of your file.

如果可以,这是我建议。提高集装箱+升压进程间有你需要使这个相对无痛的一切。

If you can, this is what I'd suggest. Boost Container + Boost Interprocess have everything you need to make this relatively painless.

您编写自定义容器抽象是直接作用于映射的数据。你既可以

You write a custom container "abstraction" that works directly on the mapped data. You could either


  • 从任何地方识别XXXX YYYY对(行结束?)或

  • 所有建行的索引开始的文件中。

使用这些你可以设计一个迭代符(升压的Iterator iterator_facade ),您可以用它来实现更高级别的操作( LOWER_BOUND UPPER_BOUND equal_range )。

Using these you can devise an interator (Boost Iterator iterator_facade) that you can use to implement higher level operations (lower_bound, upper_bound and equal_range).

一旦你有了这些,你基本上所有设置查询该存储器映射作为一个只读的键值数据库。

Once you have these, you're basically all set to query this memory map as a readonly key-value database.

不幸的是,这种记忆重新presentation会的非常的糟糕表现,如果你也想支持变异操作(插入删除)。

Sadly, this kind of memory representation would be extremely bad for performance if you also want to support mutating operations (insert, remove).

如果您有该文件的实际样品,我可以做的任何描述的方法的演示。

If you have an actual sample of the file, I could do a demonstration of either of the approaches described.

快速样品:


  1. 使用的boost ::进程间可以(非常)简单地定义你想要的多重映射:

  1. With boost::interprocess you can (very) simply define the multimap you desire:

namespace shared {
    namespace bc = boost::container;

    template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
    template <typename K, typename V>
        using multi_map = bc::flat_multimap<
            K, V, std::less<K>, 
            allocator<typename bc::flat_multimap<K, V>::value_type> >;
}

注:


  • 我选择了 flatmap flat_multimap ,实际上),因为它很可能更多
    存储效率,并且更加媲美的第二种方法
    (如下所示);

  • I chose flatmap (flat_multimap, actually) because it is likely more storage efficient, and is much more comparable to the second approach (given below);

请注意,这个选择会影响迭代器/基准的稳定性和意志
主张只读操作pretty严重。如果您需要迭代
稳定性和/或多个变异操作,使用常规的地图(或
非常高的音量一个的hash_map )代替的的变化。

Note that this choice affects iterator/reference stability and will favours read-only operations pretty heavily. If you need iterator stability and/or many mutating operations, use a regular map (or for very high volumes a hash_map) instead of the flat variations.

我选择了一个 managed_mapped_file 段为该演示(所以你得到持久性)。该演示展示了10G是如何疏pre分配,但只有实际分配的空间的使用的磁盘上。你同样可以使用 managed_shared_memory

I chose a managed_mapped_file segment for this demonstration (so you get persistence). The demo shows how 10G is sparsely pre-allocated, but only the space actually allocated is used on disk. You could equally well use a managed_shared_memory.

如果你有二进制持久性,您的可能的完全丢弃的文本数据文件。

If you have binary persistence, you might discard the text datafile altogether.

我分析文本数据转换成共享:: multi_map&LT;双,无符号&GT; mapped_file_source 使用Boost精神。实现完全通用的。

I parse the text data into a shared::multi_map<double, unsigned> from a mapped_file_source using Boost Spirit. The implementation is fully generic.

有没有必要写迭代器类, start_of_line() end_of_line() LOWER_BOUND() UPPER_BOUND() equal_range()或任何这些,因为他们是在 multi_map 接口已经标准,所以我们需要的就是写

There is no need to write iterator classes, start_of_line(), end_of_line(), lower_bound(), upper_bound(), equal_range() or any of those, since they're already standard in the multi_map interface, so all we need to is write main:

<大骨节病> 住在Coliru

#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/container/flat_map.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>

namespace bip = boost::interprocess;
namespace qi = boost::spirit::qi;

namespace shared {
    namespace bc = boost::container;

    template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
    template <typename K, typename V>
        using multi_map = bc::flat_multimap<
            K, V, std::less<K>, 
            allocator<typename bc::flat_multimap<K, V>::value_type> >;
}

#include <iostream>

bip::managed_mapped_file msm(bip::open_or_create, "lookup.bin", 10ul<<30);

template <typename K, typename V>
shared::multi_map<K,V>& get_or_load(const char* fname) {
    using Map = shared::multi_map<K, V>;
    Map* lookup = msm.find_or_construct<Map>("lookup")(msm.get_segment_manager());

    if (lookup->empty()) { 
        // only read input file if not already loaded
        boost::iostreams::mapped_file_source input(fname);
        auto f(input.data()), l(f + input.size());

        bool ok = qi::phrase_parse(f, l,
                (qi::auto_ >> qi::auto_) % qi::eol >> *qi::eol, 
                qi::blank, *lookup);

        if (!ok || (f!=l))
            throw std::runtime_error("Error during parsing at position #" + std::to_string(f - input.data()));
    }

    return *lookup;
}

int main() {
    // parse text file into shared memory binary representation
    auto const& lookup = get_or_load<double, unsigned int>("input.txt");
    auto const e = lookup.end();

    for(auto&& line : lookup)
    {
        std::cout << line.first << "\t" << line.second << "\n";

        auto er = lookup.equal_range(line.first);

        if (er.first  != e) std::cout << " lower: " << er.first->first  << "\t" << er.first->second  << "\n";
        if (er.second != e) std::cout << " upper: " << er.second->first << "\t" << er.second->second << "\n";
    }
}


  • 我实现了它完全按照我描述的:

  • I implemented it exactly as I described:


    • 遍生为const char * 区域简单的容器映射;

    • 使用的boost :: iterator_facade 来作出这样的分析上取消引用文本上的迭代;

    • 打印我用的boost :: string_ref 输入线 - 这避免了动态分配的字符串复制

    • 解析与精神齐完成的:

    • simple container over the raw const char* region mapped;
    • using boost::iterator_facade to make an iterator that parses the text on dereference;
    • for printing the input lines I use boost::string_ref - which avoids dynamic allocations for copying strings.
    • parsing is done with Spirit Qi:

    if (!qi::phrase_parse(
                b, _data.end,
                qi::auto_ >> qi::auto_ >> qi::eoi,
                qi::space,
                _data.key, _data.value)) 
    

    齐被选为速度和通用性:可以选择类型的实例时间:

    text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
    


  • 我实现了 LOWER_BOUND UPPER_BOUND equal_range 那些利用底层连续的存储成员函数。尽管行迭代器不是随机访问,但双向的,我们仍然可以跳转到 mid_point 这样的迭代器的范围,因为我们可以得到 start_of_line 任何为const char * 到下面的映射区域。这使二进制搜索效率。

  • I've implemented lower_bound, upper_bound and equal_range member functions that take advantage of underlying contiguous storage. Even though the "line" iterator is not random-access but bidirectional, we can still jump to the mid_point of such an iterator range because we can get the start_of_line from any const char* into the underlying mapped region. This make binary searching efficient.

    请注意,此解决方案解析上的迭代器的提领线。如果同样的线路被解除引用了很多次这可能不是有效的。

    Note that this solution parses lines on dereference of the iterator. This might not be efficient if the same lines are dereferenced a lot of times.

    但是,对于不频繁的查询,或者不在所述输入数据的相同区域典型查找,这大约是有效的,因为它可以有可能得到(做仅最小所需解析和 O(日志N)二进制搜索),而同时的完全的通过映射文件,而不是绕过初始加载时间(没有访问意味着什么需要加载)。

    But, for infrequent lookups, or lookups that are not typical in the same region of the input data, this is about as efficient as it can possibly get (doing only minimum required parsing and O(log n) binary searching), all the while completely bypassing the initial load time by mapping the file instead (no access means nothing needs to be loaded).

    <大骨节病> 住在Coliru (包括测试数据)

    #define NDEBUG
    #undef DEBUG
    #include <boost/iostreams/device/mapped_file.hpp>
    #include <boost/utility/string_ref.hpp>
    #include <boost/optional.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <thread>
    #include <iomanip>
    
    namespace io = boost::iostreams;
    namespace qi = boost::spirit::qi;
    
    template <typename Key, typename Value> 
    struct text_multi_lookup {
        text_multi_lookup(char const* begin, char const* end)
            : _map_begin(begin), 
              _map_end(end)
        {
        }
    
      private:
        friend struct iterator;
        enum : char { nl = '\n' };
    
        using rawit = char const*;
        rawit _map_begin, _map_end;
    
        rawit start_of_line(rawit it) const {
            while (it > _map_begin) if (*--it == nl) return it+1;
            assert(it == _map_begin);
            return it;
        }
    
        rawit end_of_line(rawit it) const {
            while (it < _map_end) if (*it++ == nl) return it;
            assert(it == _map_end);
            return it;
        }
    
      public:
        struct value_type final {
            rawit beg, end;
            Key   key;
            Value value;
    
            boost::string_ref str() const { return { beg, size_t(end-beg) }; }
        };
    
        struct iterator : boost::iterator_facade<iterator, boost::string_ref, boost::bidirectional_traversal_tag, value_type> {
    
            iterator(text_multi_lookup const& d, rawit it) : _region(&d), _data { it, nullptr, Key{}, Value{} } { 
                assert(_data.beg == _region->start_of_line(_data.beg));
            }
    
          private:
            friend text_multi_lookup;
    
            text_multi_lookup const* _region;
            value_type mutable _data;
    
            void ensure_parsed() const {
                if (!_data.end) 
                {
                    assert(_data.beg == _region->start_of_line(_data.beg));
                    auto b = _data.beg;
                    _data.end = _region->end_of_line(_data.beg);
    
                    if (!qi::phrase_parse(
                                b, _data.end,
                                qi::auto_ >> qi::auto_ >> qi::eoi,
                                qi::space,
                                _data.key, _data.value)) 
                    {
                        std::cerr << "Problem in: " << std::string(_data.beg, _data.end) 
                                  << "at:         " << std::setw(_data.end-_data.beg) << std::right << std::string(_data.beg,_data.end);
                        assert(false);
                    }
                }
            }
    
            static iterator mid_point(iterator const& a, iterator const& b) {
                assert(a._region == b._region);
                return { *a._region, a._region->start_of_line(a._data.beg + (b._data.beg -a._data.beg)/2) };
            }
    
          public:
            value_type const& dereference() const {
                ensure_parsed();
                return _data;
            }
    
            bool equal(iterator const& o) const {
                return (_region == o._region) && (_data.beg == o._data.beg);
            }
    
            void increment() {
                _data = { _region->end_of_line(_data.beg), nullptr, Key{}, Value{} };
                assert(_data.beg == _region->start_of_line(_data.beg));
            }
        };
    
        using const_iterator = iterator;
    
        const_iterator begin()  const { return { *this, _map_begin }; }
        const_iterator end()    const { return { *this, _map_end   }; }
        const_iterator cbegin() const { return { *this, _map_begin }; }
        const_iterator cend()   const { return { *this, _map_end   }; }
    
        template <typename CompatibleKey>
        const_iterator lower_bound(CompatibleKey const& key) const {
            auto f(begin()), l(end());
            while (f!=l) {
                auto m = iterator::mid_point(f,l);
    
                if (m->key < key) {
                    f = m;
                    ++f;
                }
                else {
                    l = m;
                }
            }
            return f;
        }
    
        template <typename CompatibleKey>
        const_iterator upper_bound(CompatibleKey const& key) const {
            return upper_bound(key, begin());
        }
    
      private:
        template <typename CompatibleKey>
        const_iterator upper_bound(CompatibleKey const& key, const_iterator f) const {
            auto l(end());
            while (f!=l) {
                auto m = iterator::mid_point(f,l);
    
                if (key < m->key) {
                    l = m;
                }
                else {
                    f = m;
                    ++f;
                }
            }
            return f;
        }
    
      public:
        template <typename CompatibleKey>
        std::pair<const_iterator, const_iterator> equal_range(CompatibleKey const& key) const {
            auto lb = lower_bound(key);
            return { lb, upper_bound(key, lb) };
        }
    
    };
    
    #include <iostream>
    
    int main() {
        io::mapped_file_source map("input.txt");
        text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
    
        auto const e = tml.end();
    
        for(auto&& line : tml)
        {
            std::cout << line.str();
    
            auto er = tml.equal_range(line.key);
    
            if (er.first  != e) std::cout << " lower: " << er.first->str();
            if (er.second != e) std::cout << " upper: " << er.second->str();
        }
    }
    

    有关好奇:这里的拆卸。注意:所有的算法的东西是如何内联右转进入 HTTP ://paste.ubuntu.com/9946135/

    For the curious: here's the disassembly. Note how all the algorithmic stuff is inlined right into main: http://paste.ubuntu.com/9946135/

    这篇关于使用boost ::输入输出流:: mapped_file_source用的std :: multimap中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆