在Numpy中创建笛卡尔积时出现MemoryError [英] MemoryError while creating cartesian product in Numpy

查看:453
本文介绍了在Numpy中创建笛卡尔积时出现MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3个numpy数组,需要在它们之间形成笛卡尔积.数组的尺寸不是固定的,因此可以采用不同的值,例如A =(10000,50),B =(40,50),C =(10000,50).

I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=(10000, 50), B=(40, 50), C=(10000,50).

然后,我执行一些处理(如a + b-c).下面是我用于产品的功能.

Then, I perform some processing (like a+b-c) Below is the function that I am using for the product.

def cartesian_2d(arrays, out=None):

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.shape[0] for x in arrays])
    if out is None:
        out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype=dtype)

    m = n // arrays[0].shape[0]
    out[:, 0] = np.repeat(arrays[0], m, axis=0)
    if arrays[1:]:
        cartesian_2d(arrays[1:], out=out[0:m, 1:, :])
        for j in range(1, arrays[0].shape[0]):
            out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
    return out

a = [[ 0, -0.02], [1, -0.15]]
b = [[0, 0.03]]

result = cartesian_2d([a,b,a])

// array([[[ 0.  , -0.02],
    [ 0.  ,  0.03],
    [ 0.  , -0.02]],

   [[ 0.  , -0.02],
    [ 0.  ,  0.03],
    [ 1.  , -0.15]],

   [[ 1.  , -0.15],
    [ 0.  ,  0.03],
    [ 0.  , -0.02]],

   [[ 1.  , -0.15],
    [ 0.  ,  0.03],  
    [ 1.  , -0.15]]])

输出与itertools.product相同.但是,我正在使用自定义函数来利用numpy向量化操作,在我的情况下,与itertools.product相比,它可以正常工作.

The output is the same as with itertools.product. However, I am using my custom function to take advantage of numpy vectorized operations, which is working fine compared to itertools.product in my case.

此后,我就完成

result[:, 0, :] + result[:, 1, :] - result[:, 2, :]

//array([[ 0.  ,  0.03],
       [-1.  ,  0.16],
       [ 1.  , -0.1 ],
       [ 0.  ,  0.03]])

这是最终的预期结果.

只要我的数组适合内存,该函数即可正常工作.但是我的用例要求我处理大量数据,并且在行np.empty()处收到MemoryError,因为它无法分配所需的内存. 目前,我正在处理大约20GB的数据,将来可能还会增加.

The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty() since it is unable to allocate the memory required. I am working with circa 20GB data at the moment and this might increase in future.

这些数组代表矢量,必须存储在float中,因此我不能使用int.另外,它们是密集数组,因此不能使用sparse.

These arrays represent vectors and will have to be stored in float, so I cannot use int. Also, they are dense arrays, so using sparse is not an option.

我将使用这些数组进行进一步处理,理想情况下,我目前不希望将它们存储在文件中.所以memmap/h5py格式可能无济于事,尽管我不确定.

I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap / h5py format may not help, although I am not sure of this.

如果还有其他方法来形成此产品,那也可以.

If there are other ways to form this product, that would be okay too.

由于我敢肯定有些应用程序的数据集比这大,我希望有人以前遇到过此类问题,并且想知道如何处理该问题.请帮忙.

As I am sure there are applications with way larger datasets than this, I hope someone has encountered such issues before and would like to know how to handle this issue. Please help.

推荐答案

如果至少您的结果适合存储在内存中

以下内容可产生预期结果,而无需依赖结果大小的三倍.它使用广播.

If at least your result fits in memory

The following produces your expected result without relying on an intermediate three times the size of the result. It uses broadcasting.

请注意,几乎所有NumPy操作都可以像这样广播,因此在实践中可能不需要显式笛卡尔乘积:

Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:

#shared dimensions:
sh = a.shape[1:]
aba = (a[:, None, None] + b[None, :, None] - a[None, None, :]).reshape(-1, *sh)
aba
#array([[ 0.  ,  0.03],
#       [-1.  ,  0.16],
#       [ 1.  , -0.1 ],
#       [ 0.  ,  0.03]])

通过"ID"寻址结果行

您可以考虑省略reshape.这将使您能够通过组合索引处理结果中的行.如果您的组件ID仅为0,1,2,...,如您的示例中所示,则该ID与组合ID相同.例如aba [1,0,0]对应于作为a的第二行+ b的第一行-a的第一行获得的行.

Addressing result rows by 'ID'

You may consider leaving out the reshape. That would allow you to address the rows in the result by combined index. If your component ID's are just 0,1,2,... like in your example this would be the same as the combined ID. For example aba[1,0,0] would correspond to the row obtained as second row of a + first row of b - first row of a.

广播:例如,当添加两个数组时,它们的形状不必相同,只是因为广播而兼容.从某种意义上讲,广播是对数组添加标量的一般化:

Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:

    [[2],                 [[7],   [[2],
7 +  [3],     equiv to     [7], +  [3],
     [4]]                  [7]]    [4]]

广播:

              [[4],            [[1, 2, 3],   [[4, 4, 4],
[[1, 2, 3]] +  [5],  equiv to   [1, 2, 3], +  [5, 5, 5],
               [6]]             [1, 2, 3]]    [6, 6, 6]]

要执行此操作,每个操作数的每个维数必须等于1或等于其他操作数中的相应维数(除非为1).如果一个操作数的维数少于其他维数,则其形状在左.请注意,图中未显示等值数组.

For this to work each dimension of each operand must be either 1 or equal to the corresponding dimension in each other operand unless it is 1. If an operand has fewer dimensions than the others its shape is padded with ones on the left. Note that the equiv arrays shown in the illustration are not explicitly created.

在那种情况下,我看不出如何避免使用存储,所以h5py或类似的东西.

In that case I don't see how you can possibly avoid using storage, so h5py or something like that it is.

这只是切片的问题:

a_no_id = a[:, 1:]

等请注意,与Python列表不同,切片时NumPy数组不会返回副本,而是返回视图.因此,效率(内存或运行时)在这里不是问题.

etc. Note that, unlike Python lists, NumPy arrays when sliced do not return a copy but a view. Therefore efficiency (memory or runtime) is not an issue here.

这篇关于在Numpy中创建笛卡尔积时出现MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆