Windows 与 Linux - C++ 线程池内存使用情况 [英] Windows vs Linux - C++ Thread Pool Memory Usage

查看:69
本文介绍了Windows 与 Linux - C++ 线程池内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究 Windows 和 Linux (Debian) 中一些 C++ REST API 框架的内存使用情况.特别是,我看过这两个框架:

编辑 2:根据下面关于 STL 分配器的评论,我通过将 handle_post 函数替换为以下内容并添加包含 来从 MWE 中删除地图#include #include .现在,handle_post 函数只为 500K ints 分配和设置内存,大约 2MiB.

void handle_post() {size_t 块 = 500000 * sizeof(int);if (int* p = (int*)malloc(chunk)) {memset(p, 1, 块);cout<<分配和使用"<<块<<"字节,线程 ID:"<<this_thread::get_id() <<结束;cout<<"内存地址:"<<p<<结束;无符号暂停 = 3;cout<<睡觉"<<暂停<<"秒."<<结束;this_thread::sleep_for(chrono::seconds(pause));免费(p);}}

我在这里得到了相同的行为.在示例中,我将线程数减少到 8 个,将任务数减少到 10 个.下图显示了结果.

编辑 3:我添加了在 Linux CentOS 机器上运行的结果.它与 Debian docker 镜像结果的结果基本一致.

编辑 4:根据下面的另一条评论,我在 valgrindmassif 工具下运行了该示例.massif 命令行参数如下图所示.我用 --pages-as-heap=yes 运行它,下面的第二张图片,没有这个标志,下面的第一张图片.第一张图片表明,当 handle_post 函数在线程上执行时,大约 2MiB 的内存被分配给(共享)堆,然后在函数退出时释放.这是我所期望的,也是我在 Windows 上观察到的.我不确定如何使用 --pages-as-heap=yes 来解释图表,即第二张图片.

我无法将第一张图像中 massif 的输出与图表中显示的 ps 命令中的 rss 值进行协调以上.如果我运行 Docker 映像并使用 docker run --rm -it --privileged --memory=12m"将容器内存限制为 12MB--memory-swap="12m";--name=mwe_test cpp_testing:1.0,容器在第7次分配时内存不足,被操作系统杀死.我在输出中得到 Killed,当我查看 dmesg 时,我看到 Killed process 25709 (cpp_testing) total-vm:529960kB,anon-rss:10268kB,文件-rss:2904kB, shmem-rss:0kB.这表明 ps 中的 rss 值准确地反映了进程实际使用的(堆)内存,而 massif 工具正在根据 malloc/newfree/ 计算应该删除 调用.这只是我从这个测试中得到的基本假设.我的问题仍然存在,即为什么在 handle_post 函数退出时堆内存没有被释放或释放?

编辑 5:当您将线程池中的线程数从 1 增加到 4 时,我在下面添加了内存使用图.随着线程数的增加,该模式继续到 10,所以我没有包括 5 到 10.请注意,我在 main 的开头添加了 5 秒的暂停,这是图表中前 ~5 秒的初始平线.看起来,无论线程数如何,在处理第一个任务后都会释放内存,但在任务 2 到 10 之后没有释放内存(保留以供重用?).这可能表明在执行期间调整了某些内存分配参数任务 1 执行(只是大声思考!)?

编辑 6:根据详细答案

解决方案

许多现代分配器,包括您正在使用的 glibc 2.17 中的分配器,使用多个 arenas(一种跟踪空闲内存区域的结构)) 以避免要同时分配的线程之间发生争用.

释放回一个arena的内存不能被另一个arena分配(除非触发了某种类型的跨arena转移).

默认情况下,每次新线程进行分配时,glibc 都会分配新的 arenas,直到达到预定义的限制(默认为 8 * CPU 数),如您所见 检查代码.

这样做的一个后果是,在一个线程上分配然后释放的内存可能无法用于其他线程,因为它们使用不同的区域,即使该线程没有做任何有用的工作.

您可以尝试设置 glibc malloc 可调参数 glibc.malloc.arena_max1 以强制所有线程进入同一个 arena 并查看它是否会改变您正在观察的行为.

请注意,这与用户空间分配器(在 libc 中)有关,而与操作系统的内存分配无关:操作系统永远不会被告知内存已被释放.即使您强制使用单个 arena,也不意味着用户空间分配器将决定通知操作系统:它可能只是保留内存以满足未来的请求(有可调参数来调整此行为也是).

但是,在您的测试中使用单个 arena 应该足以防止不断增加的内存占用,因为内存在下一个线程开始之前被释放,因此我们希望它被下一个任务重用,该任务从不同的线程.

最后,值得指出的是,发生的事情高度依赖于条件变量究竟如何通知线程:大概 Linux 使用 FIFO 行为,其中最近排队(等待)的线程将是最后被通知的.这会导致您在添加任务时循环遍历所有线程,从而创建许多竞技场.更有效的模式(出于各种原因)是 LIFO 策略:将最近排队的线程用于下一个作业.这将导致在您的测试中重复使用相同的线程并解决"问题.问题.

最后一点:许多分配器,但不是你正在使用的旧版 glibc 中的,也实现了一个每线程缓存,它允许分配快速路径在没有的情况下继续进行任何原子操作.这可以产生与使用多个领域类似的效果,并且随着线程的数量不断扩展.

I have been looking at the memory usage of some C++ REST API frameworks in Windows and Linux (Debian). In particular, I have looked at these two frameworks: cpprestsdk and cpp-httplib. In both, a thread pool is created and used to service requests.

I took the thread pool implementation from cpp-httplib and put it in a minimal working example below, to show the memory usage that I am observing on Windows and Linux.

#include <cassert>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <list>
#include <map>
#include <memory>
#include <mutex>
#include <string>
#include <thread>
#include <vector>

using namespace std;

// TaskQueue and ThreadPool taken from https://github.com/yhirose/cpp-httplib
class TaskQueue {
public:
    TaskQueue() = default;
    virtual ~TaskQueue() = default;

    virtual void enqueue(std::function<void()> fn) = 0;
    virtual void shutdown() = 0;

    virtual void on_idle() {};
};

class ThreadPool : public TaskQueue {
public:
    explicit ThreadPool(size_t n) : shutdown_(false) {
        while (n) {
            threads_.emplace_back(worker(*this));
            cout << "Thread number " << threads_.size() + 1 << " has ID " << threads_.back().get_id() << endl;
            n--;
        }
    }

    ThreadPool(const ThreadPool&) = delete;
    ~ThreadPool() override = default;

    void enqueue(std::function<void()> fn) override {
        std::unique_lock<std::mutex> lock(mutex_);
        jobs_.push_back(fn);
        cond_.notify_one();
    }

    void shutdown() override {
        // Stop all worker threads...
        {
            std::unique_lock<std::mutex> lock(mutex_);
            shutdown_ = true;
        }

        cond_.notify_all();

        // Join...
        for (auto& t : threads_) {
            t.join();
        }
    }

private:
    struct worker {
        explicit worker(ThreadPool& pool) : pool_(pool) {}

        void operator()() {
            for (;;) {
                std::function<void()> fn;
                {
                    std::unique_lock<std::mutex> lock(pool_.mutex_);

                    pool_.cond_.wait(
                        lock, [&] { return !pool_.jobs_.empty() || pool_.shutdown_; });

                    if (pool_.shutdown_ && pool_.jobs_.empty()) { break; }

                    fn = pool_.jobs_.front();
                    pool_.jobs_.pop_front();
                }

                assert(true == static_cast<bool>(fn));
                fn();
            }
        }

        ThreadPool& pool_;
    };
    friend struct worker;

    std::vector<std::thread> threads_;
    std::list<std::function<void()>> jobs_;

    bool shutdown_;

    std::condition_variable cond_;
    std::mutex mutex_;
};

// MWE
class ContainerWrapper {
public:
    ~ContainerWrapper() {
        cout << "Destructor: data map is of size " << data.size() << endl;
    }

    map<pair<string, string>, double> data;
};

void handle_post() {
    
    cout << "Start adding data, thread ID: " << std::this_thread::get_id() << endl;

    ContainerWrapper cw;
    for (size_t i = 0; i < 5000; ++i) {
        string date = "2020-08-11";
        string id = "xxxxx_" + std::to_string(i);
        double value = 1.5;
        cw.data[make_pair(date, id)] = value;
    }

    cout << "Data map is now of size " << cw.data.size() << endl;

    unsigned pause = 3;
    cout << "Sleep for " << pause << " seconds." << endl;
    std::this_thread::sleep_for(std::chrono::seconds(pause));
}

int main(int argc, char* argv[]) {

    cout << "ID of main thread: " << std::this_thread::get_id() << endl;

    std::unique_ptr<TaskQueue> task_queue(new ThreadPool(40));

    for (size_t i = 0; i < 50; ++i) {
        
        cout << "Add task number: " << i + 1 << endl;
        task_queue->enqueue([]() { handle_post(); });

        // Sleep enough time for the task to finish.
        std::this_thread::sleep_for(std::chrono::seconds(5));
    }

    task_queue->shutdown();

    return 0;
}

When I run this MWE and look at the memory consumption in Windows vs Linux, I get the graph below. For Windows, I used perfmon to get the Private Bytes value. In Linux, I used docker stats --no-stream --format "{{.MemUsage}} to log the container's memory usage. This was in line with res for the process from top running inside the container. It appears from the graph that when a thread allocates memory for the map variable in Windows in the handle_post function, that the memory is given back when the function exits before the next call to the function. This was the type of behaviour that I was naively expecting. I have no experience regarding how the OS deals with memory allocated by a function that is being executed in a thread when the thread stays alive i.e. like here in a thread pool. On Linux, it looks like the memory usage keeps growing and that memory is not given back when the function exits. When all 40 threads have been used, and there are 10 more tasks to process, the memory usage appears to stop growing. Can somebody give a high level view of what is happening here in Linux from a memory management point of view or even some pointers about where to look for some background info on this specific topic?

Edit 1: I have edited the graph below to show the output value of rss from running ps -p <pid> -h -o etimes,pid,rss,vsz every second in the Linux container where <pid> is the id of the process being tested. It is in reasonable agreement with the output of docker stats --no-stream --format "{{.MemUsage}}.

Edit 2: Based on a comment below regarding STL allocators, I removed the map from MWE by replacing the handle_post function with the following and adding the includes #include <cstdlib> and #include <cstring>. Now, the handle_post function just allocates and sets memory for 500K ints which is approximately 2MiB.

void handle_post() {
    
    size_t chunk = 500000 * sizeof(int);
    if (int* p = (int*)malloc(chunk)) {

        memset(p, 1, chunk);
        cout << "Allocated and used " << chunk << " bytes, thread ID: " << this_thread::get_id() << endl;
        cout << "Memory address: " << p << endl;

        unsigned pause = 3;
        cout << "Sleep for " << pause << " seconds." << endl;
        this_thread::sleep_for(chrono::seconds(pause));

        free(p);
    }
}

I get the same behaviour here. I reduced the number of threads to 8 and the number of tasks to 10 in the example. The graph below shows the results.

Edit 3: I have added the results from running on a Linux CentOS machine. It broadly agrees with the results from the Debian docker image result.

Edit 4: Based on another comment below, I ran the example under valgrind's massif tool. The massif command line parameters are in the images below. I ran it with --pages-as-heap=yes, second image below, and without this flag, first image below. The first image would suggest that ~2MiB memory is allocated to the (shared) heap as the handle_post function is executed on a thread and then freed as the function exits. This is what I would expect and what I observe on Windows. I am not sure how to interpret the graph with --pages-as-heap=yes yet, i.e. the second image.

I can't reconcile the output of massif in the first image with the value of rss from the ps command shown in the graphs above. If I run the Docker image and limit the container memory to 12MB using docker run --rm -it --privileged --memory="12m" --memory-swap="12m" --name=mwe_test cpp_testing:1.0, the container runs out of memory on the 7th allocation and is killed by the OS. I get Killed in the output and when I look at dmesg, I see Killed process 25709 (cpp_testing) total-vm:529960kB, anon-rss:10268kB, file-rss:2904kB, shmem-rss:0kB. This would suggest that the rss value from ps is accurately reflecting the (heap) memory actually being used by the process whereas the massif tool is calculating what it should be based on malloc/new and free/delete calls. This is just my basic assumption from this test. My question would still stand i.e. why is, or does it appear that, the heap memory is not being freed or deallocated when the handle_post function exits?

Edit 5: I have added below a graph of the memory usage as you increase the number of threads in the thread pool from 1 to 4. The pattern continues as you increase the number of threads up to 10 so I have not included 5 to 10. Note that I have added a 5 sec pause at the start of main which is the initial flat line in the graph for the first ~5secs. It appears that, regardless of thread count, there is a release of memory after the first task is processed but that memory is not released (kept for reuse?) after task 2 through 10. It may suggest that some memory allocation parameter is tuned during task 1 execution (just thinking out loud!)?

Edit 6: Based on the suggestion from the detailed answer below, I set the environment variable MALLOC_ARENA_MAX to 1 and 2 before running the example. This gives the output in the following graph. This is as expected based on the explanation of the effect of this variable given in the answer.

解决方案

Many modern allocators, including the one in glibc 2.17 that you are using, use multiple arenas (a structure which tracks free memory regions) in order to avoid contention between threads which want to allocate at the same time.

Memory freed back to one arena isn't available to be allocated by another arena (unless some type of cross-arena transfer is triggered).

By default, glibc will allocate new arenas every time a new thread makes an allocation, until a predefined limit is hit (which defaults to 8 * number of CPUs) as you can see by examining the code.

One consequence of this is that memory allocated then freed on a thread may not be available to other threads since they are using separate areas, even if that thread isn't doing any useful work.

You can try setting the glibc malloc tunable glibc.malloc.arena_max to 1 in order to force all threads to the same arena and see if it changes the behavior you were observing.

Note that this has everything to do with the userspace allocator (in libc) and nothing to do with the OS allocation of memory: the OS is never informed that the memory has been freed. Even if you force a single arena, it doesn't mean that the userspace allocator will decide to inform the OS: it may simply keep the memory around to satisfy a future request (there are tunables to adjust this behavior also).

However, in your test using a single arena should be enough to prevent the constantly increasing memory footprint since the memory is freed before the next thread starts, and so we expect it to be reused by the next task, which starts on a different thread.

Finally, it is worth pointing out that what happens is highly dependent on exactly how threads are notified by the condition variable: presumably Linux uses a FIFO behavior, where the most recently queued (waiting) thread will be the last to be notified. This causes you to cycle through all the threads as you add tasks, causing many arenas to be created. A more efficient pattern (for a variety of reasons) is a LIFO policy: use the most recently enqueued thread for the next job. This would cause the same thread to be repeatedly reused in your test and "solve" the problem.

Final note: many allocators, but not the on in the older version of glibc that you are using, also implement a per-thread cache which allows the allocation fast path to proceed without any atomic operations. This can produce a similar effect to the use of multiple arenas, and which keeps scaling with the number of threads.

这篇关于Windows 与 Linux - C++ 线程池内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆