如何在不将结果重新分配给新变量的情况下追加到 numpy 数组? [英] How can I append to a numpy array without reassigning the result to a new variable?

查看:70
本文介绍了如何在不将结果重新分配给新变量的情况下追加到 numpy 数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个尺寸为 (m, n) 的矩阵 M,我需要从矩阵 L 向它追加新列尺寸 (m, l).所以基本上我最终会得到一个矩阵 (m, n + l).

I have a matrix M with dimensions (m, n) and I need to append new columns to it from a matrix L with dimensions (m, l). So basically I will end up with a matrix (m, n + l).

这样做没问题,我可以使用:

No problem in doing this, I can use:

以下面的方式 np.command(M, L) 它将返回一个新矩阵.问题在于我需要将许多矩阵附加到原始矩阵中,并且这些矩阵的大小 L 事先未知.

in the following fashion np.command(M, L) and it will return me a new matrix. The problem arises with the fact that I need to append many many matrices to the original matrix, and the size of these matrices L are not known beforehand.

所以我结束了

# M is my original matrix
while:
    # find out my L matrix
    M = np.append(M, L)
    # check if I do not need to append the matrix

知道我的矩阵 M 大约有 100k 行,我平均添加了 5k 列,这个过程非常慢,需要几个小时以上(我不知道确切需要多长时间,因为2小时后我放弃了).

Knowing that my matrix M has approximately 100k rows, and I add on average 5k columns, the process is super slow and takes more than couple of hours (I don't know exactly how long because I gave up after 2 hours).

这里的问题很明显出在这个 append 函数中(我用 vstack 试过了,没有任何变化).另外,如果我只是计算矩阵 L(不附加它们),我花不到 10 分钟的时间来完成这项任务.我认为矩阵的这种重新分配是使它变慢的原因.直觉上这是有道理的,因为我不断地重新创建矩阵 M 并删除旧矩阵.但我不知道如何摆脱重新分配的部分.

The problem here is clearly in this append function (I tried it with vstack and nothing changes). Also if I just calculate matrices L (without appending them), I spend less than 10 minutes for the task. I assume that this reassigning of matrix is what makes it slow. Intuitively it makes sense because I am constantly recreating the matrix M and removing the old matrix. But I do not know how to get rid of the reassigning part.

一个想法是预先创建一个空矩阵然后用正确的列填充它应该更快,但问题是我不知道我应该用什么维度创建它(没有办法预测列数在我的矩阵中).

One idea is that creating an empty matrix beforehand and then populating it with correct columns should be faster, but the problem is that I do not know with what dimensions I should create it (there is no way to predict the number of columns in my matrix).

那么我怎样才能提高这里的性能?

So how can I improve performance here?

推荐答案

没有创建副本就无法追加到现有的 numpy 数组.

There's no way to append to an existing numpy array without creating a copy.

原因是 numpy 数组必须由连续的内存块支持.如果我创建一个 (1000, 10) 数组,然后决定要追加另一行,我需要能够扩展与该数组对应的 RAM 块,使其足够大容纳 (1001, 10) 元素.在一般情况下这是不可能的,因为相邻的内存地址可能已经分配给其他对象.

The reason is that a numpy array must be backed by a contiguous block of memory. If I create a (1000, 10) array, then decide that I want to append another row, I'd need to be able to extend the chunk of RAM corresponding to the array so that it's big enough to accommodate (1001, 10) elements. In the general case this is impossible, since the adjacent memory addresses may already be allocated to other objects.

连接"数组的唯一方法是让操作系统为新数组分配另一块足够大的内存,然后将原始数组和新行的内容复制到这个空间中.如果您在循环中重复执行,这显然非常低效,尤其是因为随着数组越来越大,复制步骤变得越来越昂贵.

The only way to 'concatenate' arrays is to get the OS to allocate another chunk of memory big enough for the new array, then copy the contents of the original array and the new row into this space. This is obviously very inefficient if you're doing it repeatedly in a loop, especially since the copying step becomes more and more expensive as your array gets larger and larger.

这里有两种可能的解决方法:

Here are two possible work-arounds:

  1. 使用标准 Python 列表在 while 循环内累积行,然后在循环外一步将列表转换为数组.与连接 numpy 数组相比,追加到 Python 列表的成本非常低,因为列表只是一个指针数组,不必引用相邻的内存地址,因此不需要复制.

  1. Use a standard Python list to accumulate your rows inside your while loop, then convert the list to an array in a single step, outside the loop. Appending to a Python list is very cheap compared with concatenating numpy arrays, since a list is just an array of pointers which don't necessarily have to reference adjacent memory addresses, and therefore no copying is required.

对最终数组中的行数进行有根据的猜测,然后分配一个稍大的 numpy 数组并在进行时填充行.如果空间不足,请连接另一块行.显然,连接步骤很昂贵,因为您需要制作一个副本,但与在循环的每次迭代中执行一次或两次相比,这样做要好得多.当您选择输出数组中的初始行数时,将在避免过度分配和不必要的串联步骤之间进行权衡.完成后,您可以使用切片索引修剪"任何未使用的行.

Take an educated guess at the number of rows in your final array, then allocate a numpy array that's slightly bigger and fill in the rows as you go along. If you run out of space, concatenate on another chunk of rows. Obviously the concatenation step is expensive, since you'll need to make a copy, but you're much better off doing this once or twice than on every iteration of your loop. When you're choosing the initial number of rows in your output array there will be a trade-off between avoiding over-allocating and unnecessary concatenation steps. Once you're done, you could then 'trim off' any unused rows using slice indexing.

这篇关于如何在不将结果重新分配给新变量的情况下追加到 numpy 数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆