即使我有超过10GB的可用内存,使用C ++读取大型(〜1GB)数据文件有时也会抛出bad_alloc [英] Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

查看:123
本文介绍了即使我有超过10GB的可用内存,使用C ++读取大型(〜1GB)数据文件有时也会抛出bad_alloc的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取大小约为〜1.1GB的.dat文件中包含的数据. 因为我是在16GB RAM的计算机上执行此操作,所以即使将整个文件一次读入内存也没有问题,只需处理后即可.

I'm trying to read the data contained in a .dat file with size ~1.1GB. Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.

为此,我使用了此SO答案中的slurp函数. 问题在于代码有时(但并非总是)引发bad_alloc异常. 看着任务管理器,我发现总是至少有10GB的可用内存,所以我看不到内存会成为一个问题.

To do this, I employed the slurp function found in this SO answer. The problem is that the code sometimes, but not always, throws a bad_alloc exception. Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.

这是重现此错误的代码

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>

using namespace std;

int main()
{
    ifstream file;
    file.open("big_file.dat");
    if(!file.is_open())
        cerr << "The file was not found\n";

    stringstream sstr;
    sstr << file.rdbuf();
    string text = sstr.str();

    cout << "Successfully read file!\n";
    return 0;
}

什么可能导致此问题? 以及避免这种情况的最佳做法是什么?

What could be causing this problem? And what are the best practices to avoid it?

推荐答案

您的系统具有16GB的事实并不意味着任何程序在任何时候都可以分配给定的内存量.实际上,如果有足够的交换空间,这可能在只有512MB物理RAM的计算机上工作,或者在具有128GB RAM的HPC节点上可能会失败–这完全取决于您的操作系统来决定可用的内存量给你,在这里.

The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.

我还认为,如果实际上处理的文件可能很大,则std::string绝不是 数据类型.

I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.

这里的要点是,完全不知道stringstream尝试分配多少内存.每当分配的内部缓冲区变得太小而无法容纳传入的字节时,一种相当合理的算法将使分配的内存量增加一倍.另外,libc ++/libc可能还会有自己的分配器,这里会有一些分配开销.

The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.

请注意,stringstream::str()返回stringstream内部状态中包含的数据的副本,再次使您至少有2.2 GB的堆已用于此任务.

Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.

确实,如果您需要将大型二进制文件中的数据作为索引操作符[]可以访问的东西来处理,请查看文件的内存映射;这样,您将获得一个指向文件开头的指针,并且可以像对待内存中的普通数组一样使用它,从而使您的操作系统可以处理底层的内存/缓冲区管理.这就是操作系统的目的!

Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!

如果您以前不了解Boost,那么现在它已成为"C ++的扩展标准库",当然,它还有一个抽象内存映射文件的类:

If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.

我正在读取的文件包含ASCII表格形式的一系列数据,即float1,float2\nfloat3,float4\n....

我正在浏览针对SO提出的各种可能的解决方案,以解决此类问题,但是我对这种特殊的行为感到疑惑.在这种情况下,您会推荐什么?

I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?

取决于;实际上,我认为处理此问题的最快方法(因为文件IO比ASCII的内存中解析慢很多,而且速度很慢)是将文件直接增量解析为内存中的float变量数组.可能利用操作系统的预取SMP功能的优势在于,如果生成单独的线程进行文件读取和浮点转换,您甚至无法获得太多的速度优势.在这里,用于从std::ifstream读取到std::vector<float>std::copy应该可以正常工作.

Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.

我仍然没有得到什么:您说文件IO比内存中的解析慢得多,而且我了解(这也是我想一次读取整个文件的原因).然后,您说最好的方法是将整个文件递增地解析为一个内存中的float数组.这到底是什么意思?这不是意味着逐行读取文件,从而导致大量的文件IO操作吗?

I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?

是,不是:首先,当然,您将有更多的上下文切换器,那么,如果您只是下令一次性读取全部内容,那么您将拥有更多.但是这些并不是很昂贵-至少,当您意识到大多数操作系统和libc非常了解如何优化读取并因此可以获取整个文件时,它们的价格会便宜得多.如果您不使用非常随机的read长度,则一次写入大量文件.而且,您不会推断出尝试分配大小至少为1.1GB的RAM块的代价-这需要进行一些认真的页表查找,而且查找速度也不快.

Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.

现在,您的想法是您偶尔会进行上下文切换,并且,如果您保持单线程运行,那么有时由于您仍在忙于将文本转换为浮点而不会读取文件,这将成为事实.仍然意味着对性能的影响较小,因为在大多数情况下,read几乎会立即返回,因为您的操作系统/运行时已经预取了文件的重要部分.

Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.

通常,对我来说,您似乎担心所有错误的事物:表现似乎对您很重要(在这里真的 重要吗?您正在动脑子吗? -用于交换浮点数的死文件格式,既浮肿,丢失信息,而且最难解析),但您最好先读入整个文件,然后开始将其转换为数字.坦白地说,如果性能对您的应用程序至关重要,则您将开始对其进行多线程/处理,以便在仍在读取数据的同时就已经可以进行字符串解析.使用几千至兆字节的缓冲区读取到\n边界并与创建浮点数内存表的线程进行交换,就像这样,基本上可以将您的读取+解析时间减少到读取+不可测量不会牺牲读取性能,也不需要千兆字节的RAM来解析顺序文件.

Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.

顺便说一句,给您的印象是ASCII中的浮点存储有多糟糕:

By the way, to give you an impression of how bad storing floats in ASCII is:

典型的32位单精度IEEE753浮点数大约有6-9个有效十进制数字.因此,您将需要至少6个字符以ASCII表示这些字符,一个.,通常是一个指数除法器,例如E,如果您的数字是从所有可能的IEEE754 32位浮点数中统一选择的,则平均为2.5个数字的十进制指数,再加上平均半个符号字符(是否为-):

The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:

-1.23456E-10

平均11个字符.

在每个数字后添加一个,\n.

Add one , or \n after every number.

现在,您的字符为1B,这意味着您将4B的实际数据炸毁了3倍,仍然失去了精度.

Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.

现在,人们总是到处告诉我明文更有用,因为如果有疑问,用户可以阅读它……我还没有看到一个可以浏览1.1GB的用户(根据我上面的计算,这就是大约9000万个浮点数或4500万个浮点对)而不会发疯.

Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.

这篇关于即使我有超过10GB的可用内存,使用C ++读取大型(〜1GB)数据文件有时也会抛出bad_alloc的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆