减少使用numpy的代码行的内存使用量 [英] Reduce memory usage of a line of code that uses numpy
问题描述
我正在使用python
库:
https://github.com/ficusss/PyGMNormalize
用于规范化我的数据集(scRNAseq
)和库文件utils.py
的最后一行:
https://github.com/ficusss/PyGMNormalize/blob /master/pygmnormalize/utils.py
使用过多的内存:
np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)
是否有重写此行代码以改善内存使用的好方法?我的意思是我可以在群集上访问200Gb
RAM
,并且使用matrix
之类的东西20Gb
,此行将无法正常工作,但是我相信应该有一种使其正常工作的方式.
如果matrix
的所有元素都> = 0,则可以执行以下操作:
np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)
使用以下事实:当以布尔值查看时,0
以外的任何浮点数或整数都将解释为True
(np.any
在内部执行).使您不必分别构建该大布尔矩阵.
由于要在matrix[...]
中进行布尔索引编制,因此您正在创建一个临时副本,如果它在percentile
进程中被覆盖,则可以忽略它.因此,您可以使用overwrite_input = True
节省更多内存.
mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?
True
最后,根据您的其他体系结构,我建议考虑使matrix
具有某种风格的scipy.sparse
,这将再次显着降低内存使用量(尽管根据所使用的类型有一些缺点). /p>
I am using the python
library:
https://github.com/ficusss/PyGMNormalize
For normalizing my datasets (scRNAseq
) and the last line in the library's file utils.py
:
https://github.com/ficusss/PyGMNormalize/blob/master/pygmnormalize/utils.py
uses too much of memory:
np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)
Is there a good way of rewriting this line of code to improve the memory usage? I mean I have 200Gb
RAM
accessible on the cluster and with the matrix
of something like 20Gb
this line fails to work, but I beliebve there should be a way of making it working.
If all elements of matrix
are >=0, then you can do:
np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)
This uses the fact that any float or integer other than 0
is interpreted as True
when viewed as a boolean (which np.any
does internally). Saves you from building that big boolean matrix seperately.
Since you're boolean indexing in matrix[...]
, you're creating a temporary copy that you don't really care if it gets overwritten during the percentile
process. Thus you can use overwrite_input = True
to save even more memory.
mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?
True
Finally, depending on your other archetecture, I'd recommend looking into making matrix
some flavor of scipy.sparse
, which should siginficantly reduce your memory usage again (although with some drawbacks depending on the type you use).
这篇关于减少使用numpy的代码行的内存使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!