从 numpy 数组列表创建 numpy 数组的 Pythonic 方法 [英] Pythonic way to create a numpy array from a list of numpy arrays

查看:32
本文介绍了从 numpy 数组列表创建 numpy 数组的 Pythonic 方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在循环中生成一维 numpy 数组的列表,然后将此列表转换为二维 numpy 数组.如果我提前知道项目的数量,我会预先分配一个 2d numpy 数组,但我没有,因此我将所有内容都放在一个列表中.

模型如下:

<预><代码>>>>list_of_arrays = map(lambda x: x*ones(2), range(5))>>>list_of_arrays[数组([ 0., 0.]), 数组([ 1., 1.]), 数组([ 2., 2.]), 数组([ 3., 3.]), 数组([ 4., 4.])]>>>arr = 数组(list_of_arrays)>>>阿尔数组([[ 0., 0.],[1., 1.],[2., 2.],[3., 3.],[ 4., 4.]])

我的问题如下:

有没有更好的方法(性能方面)来完成收集顺序数值数据(在我的例子中是 numpy 数组)的任务,而不是将它们放在一个列表中,然后从中制作一个 numpy.array(我正在创建一个新的obj 并复制数据)?在经过良好测试的模块中是否有可用的可扩展"矩阵数据结构?

我的 2d 矩阵的典型大小介于 100x10 和 5000x10 浮点数之间

在这个例子中我使用的是地图,但在我的实际应用程序中我有一个 for 循环

解决方案

假设您知道最终的数组 arr 永远不会大于 5000x10.然后你可以预先分配一个最大大小的数组,用数据填充它您遍历循环,然后使用 arr.resize 将其缩减为退出循环后发现大小.

下面的测试表明这样做会比构建中间体稍微快一点无论数组的最终大小是多少,python 都会列出.

此外,arr.resize 取消分配未使用的内存,因此最终(尽管可能不是中间)内存占用量小于 python_lists_to_array 使用的内存占用量.

这表明 numpy_all_the_way 更快:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"100 个循环,最好的 3 个:每个循环 1.78 毫秒% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"100 个循环,最好的 3 个:每个循环 18.1 毫秒% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"10 个循环,最好的 3 个:每个循环 90.4 毫秒% python -mtimeit -s"import test" "test.python_lists_to_array(100)"1000 个循环,最好的 3 个:每个循环 1.97 毫秒% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"10 个循环,最好的 3 个:每个循环 20.3 毫秒% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"10 个循环,最好的 3 个:每个循环 101 毫秒

这表明 numpy_all_the_way 使用更少的内存:

% test.py初始内存使用量:19788在 python_lists_to_array: 20976 之后在 numpy_all_the_way 之后:20348

test.py:

将 numpy 导入为 np导入操作系统def memory_usage():pid = os.getpid()return next(line for line in open('/proc/%s/status' % pid).read().splitlines()if line.startswith('VmSize')).split()[-2]N, M = 5000, 10def python_lists_to_array(k):list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))arr = np.array(list_of_arrays)返回 arrdef numpy_all_the_way(k):arr = np.empty((N, M))对于范围内的 x(k):arr[x] = x * np.ones(M)arr.resize((k, M))返回 arr如果 __name__ == '__main__':print('初始内存使用量:%s' % memory_usage())arr = python_lists_to_array(5000)print('python_lists_to_array 之后:%s' % memory_usage())arr = numpy_all_the_way(5000)打印('在 numpy_all_the_way 之后:%s' % memory_usage())

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. I would've preallocated a 2d numpy array if i knew the number of items ahead of time, but I don't, therefore I put everything in a list.

The mock up is below:

>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0.,  0.]), array([ 1.,  1.]), array([ 2.,  2.]), array([ 3.,  3.]), array([ 4.,  4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

My question is the following:

Is there a better way (performancewise) to go about the task of collecting sequential numerical data (in my case numpy arrays) than putting them in a list and then making a numpy.array out of it (I am creating a new obj and copying the data)? Is there an "expandable" matrix data structure available in a well tested module?

A typical size of my 2d matrix would be between 100x10 and 5000x10 floats

EDIT: In this example i'm using map, but in my actual application I have a for loop

解决方案

Suppose you know that the final array arr will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize to cut it down to the discovered size after exiting the loop.

The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.

Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array.

This shows numpy_all_the_way is faster:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

This shows numpy_all_the_way uses less memory:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py:

import numpy as np
import os


def memory_usage():
    pid = os.getpid()
    return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                if line.startswith('VmSize')).split()[-2]

N, M = 5000, 10


def python_lists_to_array(k):
    list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
    arr = np.array(list_of_arrays)
    return arr


def numpy_all_the_way(k):
    arr = np.empty((N, M))
    for x in range(k):
        arr[x] = x * np.ones(M)
    arr.resize((k, M))
    return arr

if __name__ == '__main__':
    print('Initial memory usage: %s' % memory_usage())
    arr = python_lists_to_array(5000)
    print('After python_lists_to_array: %s' % memory_usage())
    arr = numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s' % memory_usage())

这篇关于从 numpy 数组列表创建 numpy 数组的 Pythonic 方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆