Python 列表/字典与 numpy 数组:性能与内存控制 [英] Python lists/dictionaries vs. numpy arrays: performance vs. memory control

查看:67
本文介绍了Python 列表/字典与 numpy 数组:性能与内存控制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须反复读取数据文件并将数据存储到 (numpy) 数组中.我选择将数据存储到数据字段"的字典中:{'field1': array1,'field2': array2,...}.

案例 1(列表):

使用列表(或collections.deque()的)来附加"新的数据数组,代码高效.但是,当我连接存储在列表中的数组时,内存增长 并且我没有设法再次释放它.示例:

filename = 'test'# 具有形状矩阵 (98, 56) 的数据文件nFields = 56# 初始化数据字典和字段列表数据字典 = {}# 数据目录:每个条目包含一个列表字段名称 = []对于 xrange(nFields) 中的 i:field_names.append(repr(i))dataDict[repr(i)] = []# 读取一个数据文件N次(代表N个文件读取)# 示例中的文件包含 56 个任意长度的字段# 每次将数据字段附加到列表中(在数据字典中)N = 10000对于 xrange(N) 中的 j:xy = np.loadtxt(文件名)对于 i,field in enumerate(field_names):dataDict[field].append(xy[:,i])# 将列表成员(数组)连接到一个 numpy 数组对于 dataDict.iteritems() 中的键、值:dataDict[key] = np.concatenate(value,axis=0)

计算时间:63.4 秒
内存使用(顶部):13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python>

案例 2(numpy 数组):

每次读取时直接连接numpy数组,它低效,但内存仍然可控.示例:

nFields = 56数据字典 = {}# 数据目录:每个条目包含一个列表字段名称 = []对于 xrange(nFields) 中的 i:field_names.append(repr(i))dataDict[repr(i)] = np.array([])# 读取一个数据文件N次(代表N个文件读取)# 将数据字段连接到 numpy 数组(在数据字典中)N = 10000对于 xrange(N) 中的 j:xy = np.loadtxt(文件名)对于 i,field in enumerate(field_names):dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))

计算时间:1377.8 s
内存使用(顶部):14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python

问题:

  • 有没有什么办法既能像案例 1 那样,又能像案例 2 那样控制内存?

  • 似乎在情况 1 中,当连接列表成员 (np.concatenate(value,axis=0)) 时,内存会增长.这样做有更好的想法吗?

解决方案

以下是根据我所观察到的情况.真的没有内存泄漏.相反,Python 的内存管理代码(可能与您所在的任何操作系统的内存管理有关)决定在程序中保留原始字典(没有连接数组的字典)使用的空间.但是,它可以免费重复使用.我通过执行以下操作证明了这一点:

  1. 将您作为答案给出的代码转化为返回 dataDict 的函数.
  2. 调用函数两次并将结果分配给两个不同的变量.

当我这样做时,我发现使用的内存量仅从 ~900 GB 增加到 ~1.3 GB.如果没有额外的字典内存,根据我的计算,Numpy 数据本身应该占用大约 427 MB,所以这加起来.我们的函数创建的第二个未连接的初始字典仅使用了已分配的内存.

如果你真的不能使用超过 600 MB 的内存,那么我建议你使用你的 Numpy 数组,就像在内部使用 Python 列表所做的一样:分配一个具有一定数量列的数组,当你'已经用完这些,创建一个包含更多列的放大数组并复制数据.这将减少连接的数量,这意味着它会更快(尽管仍然不如列表快),同时保持内存使用量减少.当然,实现起来也比较痛苦.

I have to iteratively read data files and store the data into (numpy) arrays. I chose to store the data into a dictionary of "data fields": {'field1': array1,'field2': array2,...}.

Case 1 (lists):

Using lists (or collections.deque()'s) for "appending" new data arrays, the code is efficient. But, when I concatenate the arrays stored in the lists, the memory grows and I did not manage to free it again. Example:

filename = 'test'
# data file with a matrix of shape (98, 56)
nFields = 56
# Initialize data dictionary and list of fields
dataDict = {}

# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = []

# Read a data file N times (it represents N files reading)
# file contains 56 fields of arbitrary length in the example
# Append each time the data fields to the lists (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field].append(xy[:,i])

# concatenate list members (arrays) to a numpy array 
for key,value in dataDict.iteritems():
    dataDict[key] = np.concatenate(value,axis=0)

Computing time: 63.4 s
Memory usage (top): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python

Case 2 (numpy arrays):

Concatenating directly the numpy arrays each time they are readed, it is inefficient but memory remains under control. Example:

nFields = 56
dataDict = {}
# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = np.array([])

# Read a data file N times (it represents N files reading)
# Concatenate data fields to numpy arrays (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field] = np.concatenate((dataDict[field],xy[:,i])) 

Computing time: 1377.8 s
Memory usage (top): 14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python

Question(s):

  • Is there any way of having the performance of Case 1 but keeping the memory under control as in Case 2?

  • It seems that in case 1, the memory grows when concatenating list members (np.concatenate(value,axis=0)). Better ideas of doing it?

解决方案

Here's what is going on based on what I've observed. There isn't really a memory leak. Instead, Python's memory management code (possibly in connection with the memory management of whatever OS you are in) is deciding to keep the space used by the original dictionary (the one without the concatenated arrays) in the program. However, it is free to be reused. I proved this by doing the following:

  1. Making the code you gave as an answer into a function that returned dataDict.
  2. Calling the function twice and assigning the results to two different variables.

When I do this, I find that the amount of memory used only increased from ~900 GB to ~1.3 GB. Without the extra dictionary memory, the Numpy data itself should take up about 427 MB by my calculations so this adds up. The second initial, unconcatenated dictionary that our function created just used the already allocated memory.

If you really can't use more than ~600 MB of memory, then I would recommend doing with your Numpy arrays somewhat like what is done internally with Python lists: allocate an array with a certain number of columns and when you've used those up, create an enlarged array with more columns and copy the data over. This will reduce the number of concatenations, meaning it will be faster (though still not as fast as lists), while keeping the memory used down. Of course, it is also more of a pain to implement.

这篇关于Python 列表/字典与 numpy 数组:性能与内存控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆