从文件读/写std :: unordered_map的更快方法 [英] Faster way to read/write a std::unordered_map from/to a file

查看:141
本文介绍了从文件读/写std :: unordered_map的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一些非常大的 std :: unordered_map s(亿万个条目),并且需要将它们保存并从文件中加载。我目前这样做的方式是遍历地图并一次读/写每个键和值对:

I am working with some very large std::unordered_maps (hundreds of millions of entries) and need to save and load them to and from a file. The way I am currently doing this is by iterating through the map and reading/writing each key and value pair one at a time:

std::unordered_map<unsigned long long int, char> map;

void save(){
    std::unordered_map<unsigned long long int, char>::iterator iter;
    FILE *f = fopen("map", "wb");
    for(iter=map.begin(); iter!=map.end(); iter++){
        fwrite(&(iter->first), 8, 1, f);
        fwrite(&(iter->second), 1, 1, f);
    }
    fclose(f);
}

void load(){
    FILE *f = fopen("map", "rb");
    unsigned long long int key;
    char val;
    while(fread(&key, 8, 1, f)){
        fread(&val, 1, 1, f);
        map[key] = val;
    }
    fclose(f);
}

但是大约有6.24亿个条目,从文件中读取地图花费了9分钟。写入文件速度更快,但仍需要几分钟。有没有更快的方法?

But with around 624 million entries, reading the map from a file took 9 minutes. Writing to a file was faster but still took several minutes. Is there a faster way to do this?

推荐答案

C ++ unordered_map 实现必须全部使用链接。您可能要为通用哈希表执行此操作有很多非常好的理由,在

C++ unordered_map implementations must all use chaining. There are a variety of really good reasons why you might want to do this for a general purpose hash table, which are discussed here.

这对性能有巨大影响。最重要的是,这意味着哈希表的条目很可能以某种方式分散在整个内存中,这使得访问每个表的效率降低了一个数量级(或如此),而如果可以通过某种方式对其进行顺序访问的话。

This has enormous implications for performance. Most importantly, it means that the entries of the hash table are likely to be scattered throughout memory in a way which makes accessing each one an order of magnitude (or so) less efficient that would be the case if they could somehow be accessed serially.

幸运的是,您可以构建哈希表,当哈希表快满时,可以对相邻元素进行近乎顺序的访问。这是使用打开地址完成的。

Fortunately, you can build hash tables that, when nearly full, give near-sequential access to adjacent elements. This is done using open addressing.

由于您的哈希表不是通用的,您可以尝试一下。

Since your hash table is not general purpose, you could try this.

下面,我用开放地址和构建了一个简单的哈希表容器href = https://en.wikipedia.org/wiki/Linear_probing rel = nofollow noreferrer>线性探测。它假设一些事情:

Below, I've built a simple hash table container with open addressing and linear probing. It assumes a few things:


  1. 您的密钥已经以某种方式随机分布。这样就消除了对散列函数的需要(尽管体面的散列函数的构建相当简单,即使很难实现强大的散列函数)。

  1. Your keys are already somehow randomly distributed. This obviates the need for a hash function (though decent hash functions are fairly simple to build, even if great hash functions are difficult).

您只能添加元素添加到哈希表中,则不要删除它们。如果不是这种情况,则需要将二手向量更改为可以包含三种状态的东西: USED UNUSED TOMBSTONE ,其中 TOMBSTONE 是一个删除的元素,用于继续线性搜索探针或停止线性插入探针。

You only ever add elements to the hash table, you do not delete them. If this were not the case you'd need to change the used vector into something that could hold three states: USED, UNUSED, and TOMBSTONE where TOMBSTONE is the stated of a deleted element and used to continue linear search probe or halt a linear insert probe.

您可以提前知道哈希表的大小,因此您

That you know the size of your hash table ahead of time, so you don't need to resize/rehash it.

您不需要按任何特定顺序遍历元素。

That you don't need to traverse your elements in any particular order.

当然,可能存在各种出色的在线开放式哈希表解决方案,可以解决上述许多问题。但是,表的简单性使我能够传达要点。

Of course, there are probably all kinds of excellent implementations of open addressing hash tables online which solve many of the above issues. However, the simplicity of my table allows me to convey the important point.

重要的是:我的设计允许将所有哈希表的信息存储在三个向量中。即:内存是连续的。

The important point is this: my design allows all the hash table's information to be stored in three vectors. That is: the memory is contiguous.

连续的内存分配速度快,读取和写入速度快。

Contiguous memory is fast to allocate, fast to read from, and fast to write to. The effect of this is profound.

使用与我的上一个相同的测试设置答案,我得到以下时间:

Using the same test setup as my previous answer, I get the following times:

Save. Save time = 82.9345 ms
Load. Load time = 115.111 ms

这可以节省95%的时间(速度提高22倍),速度降低了98加载时间减少了%(速度提高了62倍)。

This is a 95% decrease in save time (22x faster) and a 98% decrease in load time (62x faster).

代码:

#include <cassert>
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <functional>
#include <iostream>
#include <random>
#include <vector>

const int TEST_TABLE_SIZE = 10000000;



template<class K, class V>
class SimpleHash {
 public:
  int usedslots = 0;

  std::vector<K> keys;
  std::vector<V> vals;
  std::vector<uint8_t> used;

  //size0 should be a prime and about 30% larger than the maximum number needed
  SimpleHash(int size0){
    vals.resize(size0);
    keys.resize(size0);
    used.resize(size0/8+1,0);
  }

  //If the key values are already uniformly distributed, using a hash gains us
  //nothing
  uint64_t hash(const K key){
    return key;
  }

  bool isUsed(const uint64_t loc){
    const auto used_loc = loc/8;
    const auto used_bit = 1<<(loc%8);
    return used[used_loc]&used_bit;    
  }

  void setUsed(const uint64_t loc){
    const auto used_loc = loc/8;
    const auto used_bit = 1<<(loc%8);
    used[used_loc] |= used_bit;
  }

  void insert(const K key, const V val){
    uint64_t loc = hash(key)%keys.size();

    //Use linear probing. Can create infinite loops if table too full.
    while(isUsed(loc)){ loc = (loc+1)%keys.size(); }

    setUsed(loc);
    keys[loc] = key;
    vals[loc] = val;
  }

  V& get(const K key) {
    uint64_t loc = hash(key)%keys.size();

    while(true){
      if(!isUsed(loc))
        throw std::runtime_error("Item not present!");
      if(keys[loc]==key)
        return vals[loc];

      loc = (loc+1)%keys.size();
    }
  }

  uint64_t usedSize() const {
    return usedslots;
  }

  uint64_t size() const {
    return keys.size();
  }
};

typedef SimpleHash<uint64_t, char> table_t;

void SaveSimpleHash(const table_t &map){
  std::cout<<"Save. ";
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "wb");
  uint64_t size = map.size();
  fwrite(&size, 8, 1, f);
  fwrite(map.keys.data(), 8, size, f);
  fwrite(map.vals.data(), 1, size, f);
  fwrite(map.used.data(), 1, size/8+1, f);
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}

table_t LoadSimpleHash(){
  std::cout<<"Load. ";
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "rb");

  uint64_t size;
  fread(&size, 8, 1, f);

  table_t map(size);
  fread(map.keys.data(), 8, size, f);
  fread(map.vals.data(), 1, size, f);
  fread(map.used.data(), 1, size/8+1, f);
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;

  return map;
}

int main(){
  //Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
  auto generator = std::mt19937(12345); //Combination of my luggage
  //Generate values within the specified closed intervals
  auto key_rand  = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
  auto val_rand  = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);

  table_t map(1.3*TEST_TABLE_SIZE);
  std::cout<<"Created table of size "<<map.size()<<std::endl;

  std::cout<<"Generating test data..."<<std::endl;
  for(int i=0;i<TEST_TABLE_SIZE;i++)
    map.insert(key_rand(),(char)val_rand()); //Low chance of collisions, so we get quite close to the desired size

  map.insert(23,42);
  assert(map.get(23)==42);

  SaveSimpleHash(map);
  auto newmap = LoadSimpleHash();

  //Ensure that the load worked
  for(int i=0;i<map.keys.size();i++)
    assert(map.keys.at(i)==newmap.keys.at(i));
  for(int i=0;i<map.vals.size();i++)
    assert(map.vals.at(i)==newmap.vals.at(i));  
  for(int i=0;i<map.used.size();i++)
    assert(map.used.at(i)==newmap.used.at(i));    
}

这篇关于从文件读/写std :: unordered_map的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆