C ++二进制文件I/O操作速度变慢... DB如何处理二进制文件? [英] C++ Binary File I/O Operations Slow Down... How DB Handle Binary Files?

查看:137
本文介绍了C ++二进制文件I/O操作速度变慢... DB如何处理二进制文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试制作一个简单的基于文件的哈希表.这是我的insert成员函数:

I am trying to make a simple file-based hash table. Here is my insert member function:

private: std::fstream f;  // std::ios::in | std::ios::out | std::ios::binary

public: void insert(const char* this_key, long this_value) {
    char* that_key;
    long that_value;
    long this_hash = std::hash<std::string>{}(this_key) % M;
    long that_hash;  // also block status

    long block = this_hash;
    long offset = block * BLOCK_SIZE;
    while (true) {
        this->f.seekg(offset);
        this->f.read((char*) &that_hash, sizeof(long));
        if (that_hash > -1) {  // -1 (by default) indicates a never allocated block
            this->f.read(that_key, BLOCK_SIZE);
            if (strcmp(this_key, that_key) == 0) {
                this->f.seekp(this->f.tellg());
                this->f.write((char*) &this_value, sizeof(long));
                break;
            } else {
                block = (block + 1) % M;  // linear probing
                offset = block * BLOCK_SIZE;
                continue;
            }
        } else {
            this->f.seekp(offset);
            this->f.write((char*) &this_hash, sizeof(long));  // as block status
            this->f.write(this_key, KEY_SIZE);
            this->f.write((char*) &this_value, sizeof(long));
            break;
        }
    }
}

测试了多达10M的密钥,对了50,000,017个块的值对. (二进制文件大小约为3.8GB).

Tests up to 10M key, value pairs with 50,000,017 blocks were fairly done. (Binary file size was about 3.8GB).

但是,使用50M键和250,000,013块进行的测试会极大地减慢速度……(在这种情况下,二进制文件大小超过19GB). 1,000 insert s通常需要4〜5ms,但异常情况下需要超过2,000ms.它变得越来越慢,然后需要40〜150ms ...(x10〜x30慢...)我绝对不知道...

However, a test with 50M keys and 250,000,013 blocks extremely slows down... (Binary file size was more than 19GB in this case). 1,000 inserts usually takes 4~5ms but exceptionally take more than 2,000ms. It gets slower and slower then takes 40~150ms... (x10 ~ x30 slower...) I definitely have no idea...

  • 是什么原因导致这种异常的二进制文件I/O变慢?
  • seekg& seekp和其他I/O操作是否受文件大小影响? (尽管我找不到关于此问题的任何引用...)
  • 键,值存储和数据库如何避免这种I/O速度变慢?
  • 我该如何解决这个问题?
  • What causes this exceptional binary file I/O slowing down?
  • Do seekg&seekp and other I/O operations are affected by file size? (I could not find any references about this question though...)
  • How key, value stores and databases avoid this I/O slow down?
  • How can I solve this problem?

推荐答案

数据大小

通常磁盘驱动器块大小是2的幂,因此,如果您的数据块大小也是2的幂,则可以基本上消除数据块越过磁盘块边界的情况.

Usually disk drive block size are a power of 2 so if your data block size is also a power of 2, then you can essentially eliminate the case where a data block cross a disk block boundary.

在您的情况下,64字节(如果您不需要存储散列,则为32字节)的值可能会好一些.

In your case, a value of 64 bytes (or 32 bytes if you don't really need to store the hash) might somewhat perform a bit better.

插入顺序

您可以做的另一件事是提高性能,即增加哈希顺序以减少必须从磁盘加载数据的时间.

The other thing you could do to improve performance is to do your insertion is increasing hash order to reduce the number of a time data must be loaded from the disk.

通常,在将数据读取或写入磁盘时,操作系统会一次读取/写入一个大卡盘(可能是4k),因此,如果您编写的算法是一种在本地及时写入数据的方法,则可以减少实际必须将数据读取或写入磁盘的次数.

Generally when data is read or written to the disk, the OS will read/write a large chuck at a time (maybe 4k) so if your algorithm is written is a way to write data locally in time, you will reduce the number of time data must actually be read or written to the disk.

鉴于您进行了大量插入,一种可能性是一次处理例如1000个或什至10000个键/值对的批量插入.本质上,您将在内存中存储数据并对其进行排序,一旦您有足够的项目(或当您完成插入时),便会按顺序写入数据.

Given that you make a lot of insertion, one possibility would be to process insertion in batch of say 1000 or even 10000 key/value pair at a time. Essentially, you would accumulate data in memory and sort it and once you have enough item (or when you are done inserting), you will then write the data in order.

这样,您应该能够减少非常慢的磁盘访问.如果您使用传统的硬盘驱动器,因为移动磁头的速度很慢(在这种情况下,对其进行碎片整理可能会很有用),那么这可能更为重要.另外,请确保您的硬盘驱动器具有足够的可用空间.

That way, you should be able to reduce disk access which is very slow. This is probably even more important if you are using traditional hard drive as moving the head is slow (in which case, it might be useful to defragment it). Also, be sure that your hard drive have more than enough free space.

在某些情况下,(在您的应用程序中)本地缓存也可能会有所帮助,尤其是在您知道如何使用数据的情况下.

In some case, local caching (in your application) might also be helpful particularly if you are aware how your data is used.

文件大小与冲突

使用散列时,您希望找到文件大小和冲突之间的最佳结合点.如果碰撞过多,那么您将浪费大量时间,并且在几乎无法为每个插入位置找到空闲位置的某个时刻,它可能会退化.

When you use an hash, you want to find the sweet spot between file size and collisions. If you have too much collisions, then you will waste lot of time and at some point it might degenerate when it become hard to find a free place for almost every insertion.

另一方面,如果文件非常大,则可能会遇到以下情况:您的RAM中填充的数据主要是空的,并且几乎在所有插入操作时仍需要用磁盘中的数据替换数据

On the other hand, if your file is really very large, you might end up in a case where you might fill your RAM with data that is mainly empty and still need to replace data with data from the disk on almost all insertion.

例如,如果您的数据为20GB,并且能够在内存中加载2GB,那么如果插入确实是随机的,则90%的时间您可能需要真正访问硬盘驱动器.

For example, if your data is 20GB and you are able to load say 2 GB in memory, then if insert are really random, 90% of the time you might need real access to hard drive.

配置

好的选项将取决于操作系统,并且它不在编程论坛的讨论范围之内.如果要优化计算机,则应将其放在其他位置.

Well options will depends on the OS and it is beyond the scope of a programming forum. If you want to optimize your computer, then you should look elsewhere.

阅读

阅读有关操作系统(文件系统,缓存层…)和算法(外部排序算法,B-Tree和其他结构)的信息可能会对您有所帮助.

It might be helpful to read about operating systems (file system, cache layer…) and algorithms (external sorting algorithms, B-Tree and other structures) to get a better understanding.

替代

  • 额外的内存
  • 快速SSD
  • 多线程(例如输入和输出线程)
  • 重写算法(例如一次读取/写入整个磁盘页面)
  • 更快的CPU/64位计算机
  • 使用针对此类情况设计的算法.
  • 使用数据库.
  • 分析代码
  • 调整参数

这篇关于C ++二进制文件I/O操作速度变慢... DB如何处理二进制文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆