为什么GCC的ifstream>>双分配这么多的内存? [英] Why does GCC's ifstream >> double allocate so much memory?

查看:62
本文介绍了为什么GCC的ifstream>>双分配这么多的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从 a中读取一系列数字用空格分隔的人类可读文件并进行一些数学运算,但是我在读取文件时遇到了一些真正奇怪的内存行为.

I need to read a series of numbers from a space-separated human-readable file and do some math, but I've run into some truly bizarre memory behavior just reading the file.

如果我读了这些数字并立即丢弃它们...

If I read the numbers and immediately discard them...

#include <fstream>

int main(int, char**) {
    std::ifstream ww15mgh("ww15mgh.grd");
    double value;
    while (ww15mgh >> value);
    return 0;
}

我的程序根据valgrind分配59MB的内存,相对于文件的大小线性缩放:

My program allocates 59MB of memory according to valgrind, scaling linearly with respect to the size of the file:

$ g++ stackoverflow.cpp
$ valgrind --tool=memcheck --leak-check=yes ./a.out 2>&1 | grep total
==523661==   total heap usage: 1,038,970 allocs, 1,038,970 frees, 59,302,487 

但是,如果我使用 ifstream>>字符串,然后使用 sscanf 解析字符串,我的内存使用情况看起来更加理智:

But, if I use ifstream >> string instead and then use sscanf to parse the string, my memory usage looks a lot more sane:

#include <fstream>
#include <string>
#include <cstdio>

int main(int, char**) {
    std::ifstream ww15mgh("ww15mgh.grd");
    double value;
    std::string text;
    while (ww15mgh >> text)
        std::sscanf(text.c_str(), "%lf", &value);
    return 0;
}

$ g++ stackoverflow2.cpp
$ valgrind --tool=memcheck --leak-check=yes ./a.out 2>&1 | grep total
==534531==   total heap usage: 3 allocs, 3 frees, 81,368 bytes allocated

要排除IO缓冲区是问题,我尝试了两种方法 ww15mgh.rdbuf()-> pubsetbuf(0,0); (这会使程序变老并且仍然会分配了59MB的内存)和 pubsetbuf 以及巨大的堆栈分配缓冲区(仍为59MB).在 gcc 10.2.0 上编译时,行为会重现和 clang 11.0.1 (在使用/来自 gcc-libs的usr/lib/libstdc ++.so.6 10.2.0 /usr/lib/libc.so.6 来自

To rule out the IO buffer as the issue, I've tried both ww15mgh.rdbuf()->pubsetbuf(0, 0); (which makes the program take ages and still do 59MB worth of allocations) and pubsetbuf with an enormous stack-allocated buffer (still 59MB). The behavior reproduces when compiled on either gcc 10.2.0 and clang 11.0.1 when using /usr/lib/libstdc++.so.6 from gcc-libs 10.2.0 and /usr/lib/libc.so.6 from glibc 2.32. The system locale is set to en_US.UTF-8 but this also reproduces if I set the environment variable LC_ALL=C.

我首先注意到该问题的ARM CI环境是使用 libstdc ++ 6 10.2.0 libc 2.31 .

The ARM CI environment where I first noticed the problem is cross-compiled on Ubuntu Focal using GCC 9.3.0, libstdc++6 10.2.0 and libc 2.31.

按照评论中的建议,我尝试了LLVM的libc ++,并使用原始程序获得了完美的理智行为:

Following advice in the comments, I tried LLVM's libc++ and get perfectly sane behavior with the original program:

$ clang++ -std=c++14 -stdlib=libc++ -I/usr/include/c++/v1 stackoverflow.cpp
$ valgrind --tool=memcheck --leak-check=yes ./a.out 2>&1 | grep total
==700627==   total heap usage: 3 allocs, 3 frees, 8,664 bytes allocated

因此,此行为似乎是GCC对 fstream 的实现所特有的.在构造或使用 ifstream 时,在GNU环境中进行编译时,可以避免分配大量的堆内存吗?这是他们的< fstream> 中的错误吗?

So, this behavior seems to be unique to GCC's implementation of fstream. Is there something I could do differently in constructing or using the ifstream that would avoid allocating tons of heap memory when compiled in a GNU environment? Is this a bug in their <fstream>?

在评论讨论中发现,该程序的实际内存占用完全合理(84kb),它只是数十万次分配和释放同一小部分内存,这在使用自定义分配器(如ASAN)时会产生问题避免重复使用堆空间.我发布了后续问题在"ASAN"询问如何解决这种问题.级别.

As discovered in the comments discussion, the actual memory footprint of the program is perfectly sane (84kb), it's just allocating and freeing the same small bit of memory hundreds thousands of times, which creates a problem when using custom allocators like ASAN which avoid re-using heap space. I posted a follow-up question asking how to cope with this kind of problem at the "ASAN" level.

一个 gitlab项目,该项目在其CI管道由Stack Overflow用户 @KamilCuk 慷慨贡献.

A gitlab project that reproduces the issue in its CI pipeline was generously contributed by Stack Overflow user @KamilCuk.

推荐答案

实际上不是. valgrind 显示的数字59,302,487是所有分配的 sum ,并不代表程序的实际内存消耗.

It really doesn't. The number 59,302,487 shown by valgrind is the sum of all allocations, and does not represent the actual memory consumption of the program.

事实证明,相关 operator>> 的libstdc ++实现为临时空间创建了一个临时的 std :: string ,并为其保留了32个字节.在使用后立即将其释放.参见>代码> num_get :: do_get .有了开销,这实际上可能会分配56个字节左右,从某种意义上讲,乘以大约一百万次重复确实意味着总共分配了59兆字节,这当然就是为什么该数目与输入数目成线性比例的原因.但这是相同的56个字节,一遍又一遍地分配和释放.libstdc ++完全是无辜的行为,不是泄漏或过多的内存消耗.

It turns out that the libstdc++ implementation of the relevant operator>> creates a temporary std::string for scratch space, and reserves 32 bytes for it. This is then deallocated immediately after being used. See num_get::do_get. With overhead, this perhaps actually allocates 56 bytes or so, which multiplied by about 1 million repetitions does mean, in a sense, that a total of 59 megabytes were allocated, and of course this is why that number scales linearly with the number of inputs. But it was the same 56 bytes being allocated and freed over and over again. This is perfectly innocent behavior by libstdc++ and isn't a leak or excessive memory consumption.

我没有检查libc ++源代码,但是一个不错的选择是它使用堆栈上的临时空间而不是堆.

I didn't check the libc++ source, but a good bet would be that it uses scratch space on the stack instead of the heap.

正如评论中所确定的,您的真正问题是您正在AddressSanitizer下运行此程序,这会延迟释放内存的重用,以帮助捕获释放后使用的错误.我对如何解决这个问题有一些想法(无双关语),并将它们发布在

As determined in comments, your real problem is that you are running this under AddressSanitizer, which delays the reuse of freed memory in order to help catch use-after-free errors. I have some thoughts about how to address that (no pun intended) and will post them on How do I exclude allocations in a tight loop from ASAN?

这篇关于为什么GCC的ifstream&gt;&gt;双分配这么多的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆