Cython容器不释放内存吗? [英] Does Cython container not release memory?

查看:111
本文介绍了Cython容器不释放内存吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行以下代码时,我希望一旦执行foo(),它所使用的内存(基本上用于创建m)就会被释放.但是,事实并非如此.要释放此内存,我需要重新启动IPython控制台.

%%cython
# distutils: language = c++

import numpy as np
from libcpp.map cimport map as cpp_map

cdef foo():
    cdef:
        cpp_map[int,int]    m
        int i
    for i in range(50000000):
        m[i] = i

foo()

如果有人能告诉我为什么会发生这种情况,以及如何在不重新启动Shell的情况下释放此内存,那将是很棒的.预先感谢.

解决方案

您所看到的效果或多或少是您的内存分配器(可能是glibc的默认分配器)的实现细节. glibc的内存分配器的工作方式如下:

    竞技场满足了对小内存大小的要求,竞技场不断增长/其数量根据需要而增长.
  • 对大内存的请求直接从操作系统获取,并且在释放后立即直接返回到OS.

使用 mallopt ,但是通常使用内部启发式方法来决定何时/是否应该将内存返回给OS-我最坦白的说,这对我来说是一种魔咒. >

std::map的问题(情况与std::unordered_map相似)是,它不包含将立即返回给OS的大内存块,而是许多小节点(映射)由libstdc ++实现为 Red-Black-Tree )-因此它们全部来自那些领域,启发式决定不将其返回给操作系统.

当我们使用glibc的分配器时,可以使用非标准函数

,现在在每次使用foo之后只需调用return_memory_to_OS().


以上解决方案是快速而肮脏的,但不是可移植的.您想要的是一个自定义分配器,该分配器将在不再使用内存时将其释放回OS.这是很多工作-但是幸运的是,我们已经有了这样的分配器:CPython的有时会出现问题) .但是,我们还应该指出pymalloc的一大缺陷-它不是线程安全的,因此只能用于具有gil的代码

使用pymalloc-allocator不仅具有将内存返回给OS的优点,而且因为pymalloc是8byte对齐的,而glibc的分配器是32byte对齐的,因此导致的内存消耗将较小(map[int,int]的节点为40字节,这将减少花费只有40.5字节的pymalloc (连同开销),而glibc则需要不少于64字节).

我对自定义分配器的实现遵循 Nicolai M. Josuttis的示例并实现只有真正需要的功能:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    #include <cstddef>   // std::size_t
    #include <Python.h>  // pymalloc

    template <class T>
    class pymalloc_allocator {
     public:
       // type definitions
       typedef T        value_type;
       typedef T*       pointer;
       typedef std::size_t    size_type;

       template <class U>
       pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
       pymalloc_allocator() throw() = default;
       pymalloc_allocator(const pymalloc_allocator&) throw() = default;
       ~pymalloc_allocator() throw() = default;

       // rebind allocator to type U
       template <class U>
       struct rebind {
           typedef pymalloc_allocator<U> other;
       };

       pointer allocate (size_type num, const void* = 0) {
           pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
           return ret;
       }

       void deallocate (pointer p, size_type num) {
           PyMem_Free(p);
       }

       // missing: destroy, construct, max_size, address
       //  -
   };

   // missing:
   //  bool operator== , bool operator!= 

    #include <utility>
    typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;

    //further helper (not in functional.pxd):
    #include <functional>
    typedef std::less<int> Less;
    """
    cdef cppclass PairIntIntAlloc:
        pass
    cdef cppclass Less:
        pass


from libcpp.map cimport map as cpp_map

def foo():
    cdef:
        cpp_map[int,int, Less, PairIntIntAlloc] m
        int i
    for i in range(50000000):
        m[i] = i

现在,一旦完成foo,在所有操作系统和内存分配器上,大部分已使用的内存都将返还给OS!


如果内存消耗成问题,则可以切换到unorder_map,这需要更少的内存.但是,目前unordered_map.pxd尚不能提供对所有模板参数的访问权限,因此必须手动包装它:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    ....

    //further helper (not in functional.pxd):
    #include <functional>
    ...
    typedef std::hash<int> Hash;
    typedef std::equal_to<int> Equal_to;
    """
    ...
    cdef cppclass Hash:
        pass
    cdef cppclass Equal_to:
        pass

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
        U& operator[](T&)

N = 5*10**8

def foo_unordered_pymalloc():
    cdef:
        unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
        int i
    for i in range(N):
        m[i] = i


这里有一些基准,显然还不完整,但可能显示了很好的方向(但对于N = 3e7而不是N = 5e8):

                                   Time           PeakMemory

map_default                        40.1s             1416Mb
map_default+return_memory          41.8s 
map_pymalloc                       12.8s             1200Mb

unordered_default                   9.8s             1190Mb
unordered_default+return_memory    10.9s
unordered_pymalloc                  5.5s              730Mb

计时是通过%timeit魔术完成的,峰值内存使用是通过via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py进行的.

令我有些惊讶的是,pymalloc的性能比glibc-allocator好得多,而且似乎内存分配是普通映射的瓶颈!也许这就是glibc为支持多线程而必须付出的代价.

unordered_map更快,并且可能需要更少的内存(好吧,因为重新散列最后一部分可能是错误的).

When I run the following code, I expect that once foo() has been executed, the memory used by it (basically to create m) would be released. However, that is not the case. To release this memory I need to restart the IPython console.

%%cython
# distutils: language = c++

import numpy as np
from libcpp.map cimport map as cpp_map

cdef foo():
    cdef:
        cpp_map[int,int]    m
        int i
    for i in range(50000000):
        m[i] = i

foo()

It will be great if someone could tell me why this is the case and also how to release this memory without restarting the shell. Thanks in advance.

解决方案

Effects your are seeing are more or less implementation details of your memory allocator (possible glibc's default allocator). glibc's memory allocator works as follows:

  • requests for small memory sizes are satisfied from arenas, which grow/whose number grows as needed.
  • request for large memory are directly taken from OS but also directly returned to OS as soon as they are freed.

One can tweak when the memory from those arenas is released using mallopt, but normally an internal heuristic is used which decides, when/if the memory should be returned to OS - which I most confess is kind of black magic to me.

The problem of std::map (and situation is similar for std::unordered_map) is, that it doesn't consist of a big chunk of memory which would be returned to OS immediately, but of a lot of small nodes (map is implemented as Red-Black-Tree by libstdc++) - so they all are from those arenas and the heuristic decides not return it to OS.

As we are using glibc's allocator, one could use the non-standard function malloc_trim to free the memory manually:

%%cython

cdef extern from "malloc.h" nogil:
     int malloc_trim(size_t pad)

def return_memory_to_OS():
    malloc_trim(0)

and now just call return_memory_to_OS() after every usage of foo.


The above solution is quick&dirty but is not portable. What you want to have is an custom allocator which would release the memory back to OS as soon as it is no longer used. That is a lot of work - but luckily we have already such an allocator at hand: CPython's pymalloc - since Python2.5 it returns memory to OS (even if it means sometimes trouble). However, we should also point out a big deficiency of pymalloc - it is not thread-safe, so it can be used only for code with gil!

Using pymalloc-allocator has not only the advantage of returning the memory to OS but also because pymalloc is 8byte-aligned while glibc's allocator is 32byte aligned the resulting memory consumption will be smaller (nodes of map[int,int] are 40 bytes which will cost only 40.5 bytes with pymalloc (together with overhead) while glibc will needs not less than 64 bytes).

My implementation of the custom allocator follows Nicolai M. Josuttis' example and implements only the really needed functionality:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    #include <cstddef>   // std::size_t
    #include <Python.h>  // pymalloc

    template <class T>
    class pymalloc_allocator {
     public:
       // type definitions
       typedef T        value_type;
       typedef T*       pointer;
       typedef std::size_t    size_type;

       template <class U>
       pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
       pymalloc_allocator() throw() = default;
       pymalloc_allocator(const pymalloc_allocator&) throw() = default;
       ~pymalloc_allocator() throw() = default;

       // rebind allocator to type U
       template <class U>
       struct rebind {
           typedef pymalloc_allocator<U> other;
       };

       pointer allocate (size_type num, const void* = 0) {
           pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
           return ret;
       }

       void deallocate (pointer p, size_type num) {
           PyMem_Free(p);
       }

       // missing: destroy, construct, max_size, address
       //  -
   };

   // missing:
   //  bool operator== , bool operator!= 

    #include <utility>
    typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;

    //further helper (not in functional.pxd):
    #include <functional>
    typedef std::less<int> Less;
    """
    cdef cppclass PairIntIntAlloc:
        pass
    cdef cppclass Less:
        pass


from libcpp.map cimport map as cpp_map

def foo():
    cdef:
        cpp_map[int,int, Less, PairIntIntAlloc] m
        int i
    for i in range(50000000):
        m[i] = i

Now, lion's share of the used memory is returned to OS once foo is done - on any operating system and memory allocator!


If memory consumption is an issue, one could switch to unorder_map which needs somewhat less memory. However, as of the moment unordered_map.pxd doesn't offer access to all template-parameters, so one will have to wrap it manually:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    ....

    //further helper (not in functional.pxd):
    #include <functional>
    ...
    typedef std::hash<int> Hash;
    typedef std::equal_to<int> Equal_to;
    """
    ...
    cdef cppclass Hash:
        pass
    cdef cppclass Equal_to:
        pass

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
        U& operator[](T&)

N = 5*10**8

def foo_unordered_pymalloc():
    cdef:
        unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
        int i
    for i in range(N):
        m[i] = i


Here are some benchmarks, which are obviously not complete, but probably show the direction pretty well (but for N=3e7 instead of N=5e8):

                                   Time           PeakMemory

map_default                        40.1s             1416Mb
map_default+return_memory          41.8s 
map_pymalloc                       12.8s             1200Mb

unordered_default                   9.8s             1190Mb
unordered_default+return_memory    10.9s
unordered_pymalloc                  5.5s              730Mb

The timings were done via %timeit magic and peak memory usage via via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py.

I'm somewhat surprised, that pymalloc outperforms the glibc-allocator by so much and also that it seems as if memory allocations are the bottle-neck for the usual map! Maybe this is the price glibc must pay for supporting multi-threading.

unordered_map is faster and maybe needs less memory (ok, because of the rehashing the last part could be wrong).

这篇关于Cython容器不释放内存吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆