按最大大小将numpy数组拆分为多个块 [英] Split numpy array into chunks by maxmimum size

查看:98
本文介绍了按最大大小将numpy数组拆分为多个块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些非常大型二维numpy数组.一个数据集是55732 x 257659,超过140亿个元素.因为我需要执行throw MemoryError的某些操作,所以我想尝试将数组拆分为一定大小的块,然后将它们针对这些块运行. (我可以在每个片段上运行该操作后汇总结果.)我的问题是MemoryErrors这个事实意味着,重要的是我可以以某种方式限制数组的大小,而不是将它们拆分为恒定的片段数.

I have some very large two-dimensional numpy arrays. One data set is 55732 by 257659, which is over 14 billion elements. Because some operations I need to perform throw MemoryErrors, I would like to try splitting the array up into chunks of a certain size and running them against the chunks. (I can aggregate the results after the operation runs on each piece.) The fact that my problem is MemoryErrors means that it's important that I can cap the size of the arrays somehow, rather than split them into a constant number of pieces.

例如,让我们生成一个1009 x 1009随机数组:

For an example, let's generate a 1009 by 1009 random array:

a = numpy.random.choice([1,2,3,4], (1009,1009))

我的数据没有必要进行均分,并且绝对不能保证可以按我想要的大小进行分割.所以我选择1009是因为它是素数.

My data isn't necessary possible to split evenly, and it's definitely not guaranteed to be splitable by the size I want. So I've chosen 1009 because it's prime.

让我们说,我希望它们以不大于50乘50的块的形式出现.由于这只是为了避免出现非常大的数组错误,因此如果结果不正确也可以.

Let's also say I want them in chunks of no larger than 50 by 50. Since this is just to avoid errors with extremely large arrays, it's okay if the result isn't exact.

如何将其分成所需的块?

How can I split this into the desired chunks?

我正在使用numpy 1.14.3(最新版本)的Python 3.6 64位.

I'm using Python 3.6 64-bit with numpy 1.14.3 (latest).

我已经看到此使用reshape 的函数,但是如果行数和列不能完全划分大小.

I have seen this function that uses reshape, but it doesn't work if the number of rows and columns doesn't divide the size exactly.

这个问题(以及其他类似问题)的答案说明了如何拆分为一定数量的块,但这没有说明如何拆分为特定大小.

This question (among other similar ones) has answers explaining how to split into a certain number of chunks, but this does not explain how to split into a certain size.

我还看到了这个问题,因为这实际上是我的确切问题.答案和注释建议切换到64位(我已经拥有)并使用numpy.memmap.都没有帮助.

I also saw this question, since it's actually my exact problem. The answer and comments suggests switching to 64 bit (which I already have) and using numpy.memmap. Neither helped.

推荐答案

可以这样做,以使生成的数组的形状略小于所需的最大值,或者使其具有所需的最大值,但末尾有一些余数

This can be done so that the resulting arrays have shapes slightly less than the desired maximum or so that they have exactly the desired maximum except for some remainder at the end.

基本逻辑是计算用于拆分数组的参数,然后使用

The basic logic is to compute the parameters for splitting the array and then use array_split to split the array along each axis (or dimension) of the array.

我们需要numpymath模块以及示例数组:

We'll need the numpy and math modules and the example array:

import math
import numpy

a = numpy.random.choice([1,2,3,4], (1009,1009))

略小于最大值

逻辑

首先将最终的分块大小的形状沿您想要将其拆分为元组的每个维度存储:

Slightly less than max

The logic

First store the shape of the final chunk size along each dimension that you want to split it into in a tuple:

chunk_shape = (50, 50)

array_split一次只能沿一个轴(或维度)或数组拆分.因此,让我们从第一个轴开始.

array_split only splits along one axis (or dimension) or an array at a time. So let's start with just the first axis.

  1. 计算将数组拆分为以下部分所需的节数:

  1. Compute the number of sections we need to split the array into:

num_sections = math.ceil(a.shape[0] / chunk_shape[0])

在我们的示例中,这是21(1009 / 50 = 20.18).

In our example case, this is 21 (1009 / 50 = 20.18).

现在将其拆分:

first_split = numpy.array_split(a, num_sections, axis=0)

这给了我们21个(请求部分的数量)numpy数组的列表,这些数组被拆分为在第一维中不大于50:

This gives us a list of 21 (the number of requested sections) numpy arrays that are split so they are no larger than 50 in the first dimension:

print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(48, 1009), (49, 1009)}
# These are the distinct shapes, so we don't see all 21 separately

在这种情况下,它们沿该轴分别为48和49.

In this case, they're 48 and 49 along that axis.

对于第二维,我们可以对每个新数组执行相同的操作:

We can do the same thing to each new array for the second dimension:

num_sections = math.ceil(a.shape[1] / chunk_shape[1])
second_split = [numpy.array_split(a2, num_sections, axis=1) for a2 in first_split]

这给了我们一个列表清单.每个子列表包含所需大小的numpy数组:

This gives us a list of lists. Each sublist contains numpy arrays of the size we wanted:

print(len(second_split))
# 21
print({len(i) for i in second_split})
# {21}
# All sublists are 21 long
print({i2.shape for i in second_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

完整功能

我们可以使用递归函数将其实现为任意维度:

The full function

We can implement this for arbitrary dimensions using a recursive function:

def split_to_approx_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    num_sections = math.ceil(a.shape[start_axis] / chunk_shape[start_axis])
    split = numpy.array_split(a, num_sections, axis=start_axis)
    return [split_to_approx_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

我们这样称呼它:

full_split = split_to_approx_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

精确的形状加上余数

逻辑

如果我们想成为一个幻想者,并且让所有新数组精确地 指定的大小(尾随的剩余数组除外),我们可以通过传递一个索引列表来进行分割,以拆分为array_split.

Exact shapes plus remainder

The logic

If we want to be a little fancier and have all the new arrays be exactly the specified size except for a trailing leftover array, we can do that by passing a list of indices to split at to array_split.

  1. 首先建立索引数组:

  1. First build up the array of indices:

axis = 0
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]

这提供了使用索引的列表,每个索引从最后一个开始为50:

This gives use a list of indices, each 50 from the last:

print(split_indices)
# [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]

  • 然后拆分:

  • Then split:

    first_split = numpy.array_split(a, split_indices, axis=0)
    print(len(first_split))
    # 21
    print({i.shape for i in first_split})
    # {(9, 1009), (50, 1009)}
    # Distinct shapes, so we don't see all 21 separately
    print((first_split[0].shape, first_split[1].shape, '...', first_split[-2].shape, first_split[-1].shape))
    # ((50, 1009), (50, 1009), '...', (50, 1009), (9, 1009))
    

  • 然后再次输入第二个轴:

  • And then again for the second axis:

    axis = 1
    split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]
    second_split = [numpy.array_split(a2, split_indices, axis=1) for a2 in first_split]
    print({i2.shape for i in second_split for i2 in i})
    # {(9, 50), (9, 9), (50, 9), (50, 50)}
    

  • 完整功能

    适应递归功能:

    The full function

    Adapting the recursive function:

    def split_to_shape(a, chunk_shape, start_axis=0):
        if len(chunk_shape) != len(a.shape):
            raise ValueError('chunk length does not match array number of axes')
    
        if start_axis == len(a.shape):
            return a
    
        split_indices = [
            chunk_shape[start_axis]*(i+1)
            for i in range(math.floor(a.shape[start_axis] / chunk_shape[start_axis]))
        ]
        split = numpy.array_split(a, split_indices, axis=start_axis)
        return [split_to_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]
    

    我们用完全相同的方式称呼它:

    And we call it exactly the same way:

    full_split = split_to_shape(a, (50,50))
    print({i2.shape for i in full_split for i2 in i})
    # {(9, 50), (9, 9), (50, 9), (50, 50)}
    # Distinct shapes
    

    额外说明

    性能

    这些功能似乎非常快.使用以下任一功能,我都可以在0.05秒内将示例数组(包含140亿个元素)拆分为1000乘1000的形状的块(导致超过14000个新数组):

    Extra Notes

    Performance

    These functions seem to be quite fast. I was able to split up my example array (with over 14 billion elements) into 1000 by 1000 shaped pieces (resulting in over 14000 new arrays) in under 0.05 seconds with either function:

    print('Building test array')
    a = numpy.random.randint(4, size=(55000, 250000), dtype='uint8')
    chunks = (1000, 1000)
    numtests = 1000
    print('Running {} tests'.format(numtests))
    print('split_to_approx_shape: {} seconds'.format(timeit.timeit(lambda: split_to_approx_shape(a, chunks), number=numtests) / numtests))
    print('split_to_shape: {} seconds'.format(timeit.timeit(lambda: split_to_shape(a, chunks), number=numtests) / numtests))
    

    输出:

    Building test array
    Running 1000 tests
    split_to_approx_shape: 0.035109398348040485 seconds
    split_to_shape: 0.03113800323300747 seconds
    

    我没有测试高维数组的速度.

    I did not test speed with higher dimension arrays.

    如果任何尺寸的尺寸小于指定的最大值,则这两个功能都可以正常工作.这不需要特殊的逻辑.

    These functions both work properly if the size of any dimension is less than the specified maximum. This requires no special logic.

    这篇关于按最大大小将numpy数组拆分为多个块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆