减少使用numpy的代码行的内存使用量 [英] Reduce memory usage of a line of code that uses numpy

查看:140
本文介绍了减少使用numpy的代码行的内存使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python库:

https://github.com/ficusss/PyGMNormalize

用于规范化我的数据集(scRNAseq)和库文件utils.py的最后一行:

https://github.com/ficusss/PyGMNormalize/blob /master/pygmnormalize/utils.py

使用过多的内存:

np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

是否有重写此行代码以改善内存使用的好方法?我的意思是我可以在群集上访问200Gb RAM,并且使用matrix之类的东西20Gb,此行将无法正常工作,但是我相信应该有一种使其正常工作的方式.

解决方案

如果matrix的所有元素都> = 0,则可以执行以下操作:

np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)

使用以下事实:当以布尔值查看时,0以外的任何浮点数或整数都将解释为True(np.any在内部执行).使您不必分别构建该大布尔矩阵.

由于要在matrix[...]中进行布尔索引编制,因此您正在创建一个临时副本,如果它在percentile进程中被覆盖,则可以忽略它.因此,您可以使用overwrite_input = True节省更多内存.

mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?

True

最后,根据您的其他体系结构,我建议考虑使matrix具有某种风格的scipy.sparse,这将再次显着降低内存使用量(尽管根据所使用的类型有一些缺点). /p>

I am using the python library:

https://github.com/ficusss/PyGMNormalize

For normalizing my datasets (scRNAseq) and the last line in the library's file utils.py:

https://github.com/ficusss/PyGMNormalize/blob/master/pygmnormalize/utils.py

uses too much of memory:

np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

Is there a good way of rewriting this line of code to improve the memory usage? I mean I have 200Gb RAM accessible on the cluster and with the matrix of something like 20Gb this line fails to work, but I beliebve there should be a way of making it working.

解决方案

If all elements of matrix are >=0, then you can do:

np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)

This uses the fact that any float or integer other than 0 is interpreted as True when viewed as a boolean (which np.any does internally). Saves you from building that big boolean matrix seperately.

Since you're boolean indexing in matrix[...], you're creating a temporary copy that you don't really care if it gets overwritten during the percentile process. Thus you can use overwrite_input = True to save even more memory.

mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?

True

Finally, depending on your other archetecture, I'd recommend looking into making matrix some flavor of scipy.sparse, which should siginficantly reduce your memory usage again (although with some drawbacks depending on the type you use).

这篇关于减少使用numpy的代码行的内存使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆