Cython容器不释放内存吗? [英] Does Cython container not release memory?
问题描述
当我运行以下代码时,我希望一旦执行foo()
,它所使用的内存(基本上用于创建m
)就会被释放.但是,事实并非如此.要释放此内存,我需要重新启动IPython控制台.
%%cython
# distutils: language = c++
import numpy as np
from libcpp.map cimport map as cpp_map
cdef foo():
cdef:
cpp_map[int,int] m
int i
for i in range(50000000):
m[i] = i
foo()
如果有人能告诉我为什么会发生这种情况,以及如何在不重新启动Shell的情况下释放此内存,那将是很棒的.预先感谢.
您所看到的效果或多或少是您的内存分配器(可能是glibc的默认分配器)的实现细节. glibc的内存分配器的工作方式如下:
-
竞技场满足了对小内存大小的要求,竞技场不断增长/其数量根据需要而增长.
- 对大内存的请求直接从操作系统获取,并且在释放后立即直接返回到OS.
使用 mallopt
,但是通常使用内部启发式方法来决定何时/是否应该将内存返回给OS-我最坦白的说,这对我来说是一种魔咒. >
std::map
的问题(情况与std::unordered_map
相似)是,它不包含将立即返回给OS的大内存块,而是许多小节点(映射)由libstdc ++实现为 Red-Black-Tree )-因此它们全部来自那些领域,启发式决定不将其返回给操作系统.
当我们使用glibc的分配器时,可以使用非标准函数 ,现在在每次使用 以上解决方案是快速而肮脏的,但不是可移植的.您想要的是一个自定义分配器,该分配器将在不再使用内存时将其释放回OS.这是很多工作-但是幸运的是,我们已经有了这样的分配器:CPython的有时会出现问题) .但是,我们还应该指出pymalloc的一大缺陷-它不是线程安全的,因此只能用于具有gil的代码! 使用pymalloc-allocator不仅具有将内存返回给OS的优点,而且因为pymalloc是8byte对齐的,而glibc的分配器是32byte对齐的,因此导致的内存消耗将较小( 我对自定义分配器的实现遵循 Nicolai M. Josuttis的示例并实现只有真正需要的功能: 现在,一旦完成 如果内存消耗成问题,则可以切换到
这里有一些基准,显然还不完整,但可能显示了很好的方向(但对于N = 3e7而不是N = 5e8): 计时是通过 令我有些惊讶的是,pymalloc的性能比glibc-allocator好得多,而且似乎内存分配是普通映射的瓶颈!也许这就是glibc为支持多线程而必须付出的代价. When I run the following code, I expect that once It will be great if someone could tell me why this is the case and also how to release this memory without restarting the shell. Thanks in advance. Effects your are seeing are more or less implementation details of your memory allocator (possible glibc's default allocator). glibc's memory allocator works as follows: One can tweak when the memory from those arenas is released using The problem of As we are using glibc's allocator, one could use the non-standard function and now just call The above solution is quick&dirty but is not portable. What you want to have is an custom allocator which would release the memory back to OS as soon as it is no longer used. That is a lot of work - but luckily we have already such an allocator at hand: CPython's pymalloc - since Python2.5 it returns memory to OS (even if it means sometimes trouble). However, we should also point out a big deficiency of pymalloc - it is not thread-safe, so it can be used only for code with gil! Using pymalloc-allocator has not only the advantage of returning the memory to OS but also because pymalloc is 8byte-aligned while glibc's allocator is 32byte aligned the resulting memory consumption will be smaller (nodes of My implementation of the custom allocator follows Nicolai M. Josuttis' example and implements only the really needed functionality: Now, lion's share of the used memory is returned to OS once If memory consumption is an issue, one could switch to
Here are some benchmarks, which are obviously not complete, but probably show the direction pretty well (but for N=3e7 instead of N=5e8): The timings were done via I'm somewhat surprised, that pymalloc outperforms the glibc-allocator by so much and also that it seems as if memory allocations are the bottle-neck for the usual map! Maybe this is the price glibc must pay for supporting multi-threading. 这篇关于Cython容器不释放内存吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!foo
之后只需调用return_memory_to_OS()
.
map[int,int]
的节点为40字节,这将减少花费只有40.5字节的pymalloc (连同开销),而glibc则需要不少于64字节).%%cython -c=-std=c++11 --cplus
cdef extern from *:
"""
#include <cstddef> // std::size_t
#include <Python.h> // pymalloc
template <class T>
class pymalloc_allocator {
public:
// type definitions
typedef T value_type;
typedef T* pointer;
typedef std::size_t size_type;
template <class U>
pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
pymalloc_allocator() throw() = default;
pymalloc_allocator(const pymalloc_allocator&) throw() = default;
~pymalloc_allocator() throw() = default;
// rebind allocator to type U
template <class U>
struct rebind {
typedef pymalloc_allocator<U> other;
};
pointer allocate (size_type num, const void* = 0) {
pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
return ret;
}
void deallocate (pointer p, size_type num) {
PyMem_Free(p);
}
// missing: destroy, construct, max_size, address
// -
};
// missing:
// bool operator== , bool operator!=
#include <utility>
typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;
//further helper (not in functional.pxd):
#include <functional>
typedef std::less<int> Less;
"""
cdef cppclass PairIntIntAlloc:
pass
cdef cppclass Less:
pass
from libcpp.map cimport map as cpp_map
def foo():
cdef:
cpp_map[int,int, Less, PairIntIntAlloc] m
int i
for i in range(50000000):
m[i] = i
foo
,在所有操作系统和内存分配器上,大部分已使用的内存都将返还给OS!
unorder_map
,这需要更少的内存.但是,目前unordered_map.pxd
尚不能提供对所有模板参数的访问权限,因此必须手动包装它:%%cython -c=-std=c++11 --cplus
cdef extern from *:
"""
....
//further helper (not in functional.pxd):
#include <functional>
...
typedef std::hash<int> Hash;
typedef std::equal_to<int> Equal_to;
"""
...
cdef cppclass Hash:
pass
cdef cppclass Equal_to:
pass
cdef extern from "<unordered_map>" namespace "std" nogil:
cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
U& operator[](T&)
N = 5*10**8
def foo_unordered_pymalloc():
cdef:
unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
int i
for i in range(N):
m[i] = i
Time PeakMemory
map_default 40.1s 1416Mb
map_default+return_memory 41.8s
map_pymalloc 12.8s 1200Mb
unordered_default 9.8s 1190Mb
unordered_default+return_memory 10.9s
unordered_pymalloc 5.5s 730Mb
%timeit
魔术完成的,峰值内存使用是通过via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py
进行的.unordered_map
更快,并且可能需要更少的内存(好吧,因为重新散列最后一部分可能是错误的).foo()
has been executed, the memory used by it (basically to create m
) would be released. However, that is not the case. To release this memory I need to restart the IPython console.%%cython
# distutils: language = c++
import numpy as np
from libcpp.map cimport map as cpp_map
cdef foo():
cdef:
cpp_map[int,int] m
int i
for i in range(50000000):
m[i] = i
foo()
mallopt
, but normally an internal heuristic is used which decides, when/if the memory should be returned to OS - which I most confess is kind of black magic to me.std::map
(and situation is similar for std::unordered_map
) is, that it doesn't consist of a big chunk of memory which would be returned to OS immediately, but of a lot of small nodes (map is implemented as Red-Black-Tree by libstdc++) - so they all are from those arenas and the heuristic decides not return it to OS.malloc_trim
to free the memory manually:%%cython
cdef extern from "malloc.h" nogil:
int malloc_trim(size_t pad)
def return_memory_to_OS():
malloc_trim(0)
return_memory_to_OS()
after every usage of foo
.
map[int,int]
are 40 bytes which will cost only 40.5 bytes with pymalloc (together with overhead) while glibc will needs not less than 64 bytes).%%cython -c=-std=c++11 --cplus
cdef extern from *:
"""
#include <cstddef> // std::size_t
#include <Python.h> // pymalloc
template <class T>
class pymalloc_allocator {
public:
// type definitions
typedef T value_type;
typedef T* pointer;
typedef std::size_t size_type;
template <class U>
pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
pymalloc_allocator() throw() = default;
pymalloc_allocator(const pymalloc_allocator&) throw() = default;
~pymalloc_allocator() throw() = default;
// rebind allocator to type U
template <class U>
struct rebind {
typedef pymalloc_allocator<U> other;
};
pointer allocate (size_type num, const void* = 0) {
pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
return ret;
}
void deallocate (pointer p, size_type num) {
PyMem_Free(p);
}
// missing: destroy, construct, max_size, address
// -
};
// missing:
// bool operator== , bool operator!=
#include <utility>
typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;
//further helper (not in functional.pxd):
#include <functional>
typedef std::less<int> Less;
"""
cdef cppclass PairIntIntAlloc:
pass
cdef cppclass Less:
pass
from libcpp.map cimport map as cpp_map
def foo():
cdef:
cpp_map[int,int, Less, PairIntIntAlloc] m
int i
for i in range(50000000):
m[i] = i
foo
is done - on any operating system and memory allocator!
unorder_map
which needs somewhat less memory. However, as of the moment unordered_map.pxd
doesn't offer access to all template-parameters, so one will have to wrap it manually:%%cython -c=-std=c++11 --cplus
cdef extern from *:
"""
....
//further helper (not in functional.pxd):
#include <functional>
...
typedef std::hash<int> Hash;
typedef std::equal_to<int> Equal_to;
"""
...
cdef cppclass Hash:
pass
cdef cppclass Equal_to:
pass
cdef extern from "<unordered_map>" namespace "std" nogil:
cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
U& operator[](T&)
N = 5*10**8
def foo_unordered_pymalloc():
cdef:
unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
int i
for i in range(N):
m[i] = i
Time PeakMemory
map_default 40.1s 1416Mb
map_default+return_memory 41.8s
map_pymalloc 12.8s 1200Mb
unordered_default 9.8s 1190Mb
unordered_default+return_memory 10.9s
unordered_pymalloc 5.5s 730Mb
%timeit
magic and peak memory usage via via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py
.unordered_map
is faster and maybe needs less memory (ok, because of the rehashing the last part could be wrong).