与倒数第二个索引相比,最后一个索引对numpy数组的访问时间影响更大 [英] The accessing time of a numpy array is impacted much more by the last index compared to the second last

查看:87
本文介绍了与倒数第二个索引相比,最后一个索引对numpy数组的访问时间影响更大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对我先前的问题此答案的后续行动. /stackoverflow.com/questions/44078327/fastest-approach-to-read-thousands-of-images-into-one-big-numpy-array>最快的方法是将数千个图像读取到一个大的numpy数组中

This is a follow up to this answer to my previous question Fastest approach to read thousands of images into one big numpy array.

Travis Oliphant在第2.3章"ndarray的内存分配"中以下是有关如何在内存中访问C排序的numpy数组的索引.

In chapter 2.3 "Memory allocation of the ndarray", Travis Oliphant writes the following regarding how indexes are accessed in memory for C-ordered numpy arrays.

...要按顺序在计算机内存中移动,最后一个索引会先递增,然后是倒数第二个索引,依此类推.

...to move through computer memory sequentially, the last index is incremented first, followed by the second-to-last index and so forth.

这可以通过沿两个第一个或最后两个索引对二维数组的访问时间进行基准测试来确认(出于我的目的,这是加载500个大小为512x512像素的图像的模拟):

This can be confirmed by benchmarking the accessing time of 2-D arrays either along the two first or the two last indexes (for my purposes, this is a simulation of loading 500 images of size 512x512 pixels):

import numpy as np

N = 512
n = 500
a = np.random.randint(0,255,(N,N))

def last_and_second_last():
    '''Store along the two last indexes'''
    imgs = np.empty((n,N,N), dtype='uint16')
    for num in range(n):
        imgs[num,:,:] = a
    return imgs

def second_and_third_last():
    '''Store along the two first indexes'''
    imgs = np.empty((N,N,n), dtype='uint16')
    for num in range(n):
        imgs[:,:,num] = a
    return imgs

基准

In [2]: %timeit last_and_second_last()
136 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit second_and_third_last()
1.56 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

到目前为止,一切都很好.但是,当我沿着最后一个维度和最后一个维度加载数组时,这几乎与将它们加载到最后两个维度中一样快.

So far so good. However, when I load arrays along the last and third last dimension, this is almost as fast as loading them into the two last dimensions.

def last_and_third_last():
    '''Store along the last and first indexes'''
    imgs = np.empty((N,n,N), dtype='uint16')
    for num in range(n):    
        imgs[:,num,:] = a
    return imgs

基准

In [4]: %timeit last_and_third_last()
149 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

  • 为什么last_and_third_last()second_and third_last()相比我在速度上更接近last_and_second_last()?
  • 在访问速度方面,可视化最后一个索引比第二个索引重要得多的好方法是什么?
    • Why is it that last_and_third_last() is so my closer in speed to last_and_second_last() compared to second_and third_last()?
    • What's a good way to visualize why the last index matters much more than the second last index in regards to the accessing speed?
    • 推荐答案

      我将尝试说明索引编制,而不涉及处理器缓存等的详细信息.

      I'll try to illustrate the indexing, without getting into details of processor caching etc.

      让我们制作一个具有独特元素值的小型3d数组:

      Lets make a small 3d array with distinctive element values:

      In [473]: X = np.mgrid[100:300:100,10:30:10,1:4:1].sum(axis=0)
      In [474]: X
      Out[474]: 
      array([[[111, 112, 113],
              [121, 122, 123]],
      
             [[211, 212, 213],
              [221, 222, 223]]])
      In [475]: X.shape
      Out[475]: (2, 2, 3)
      

      ravel将其视为一维数组,并向我们展示如何将值布置在内存中. (顺便说一下,这是默认的C排序)

      ravel views it as a 1d array, and shows us how the values are laid out in memory. (This, incidentally,is with the default C ordering)

      In [476]: X.ravel()
      Out[476]: array([111, 112, 113, 121, 122, 123, 211, 212, 213, 221, 222, 223])
      

      当我在第一维上索引时,我得到2 * 3值,即上面列表中的一个连续块:

      When I index on the 1st dimension I get 2*3 values, a contiguous block of the above list:

      In [477]: X[0,:,:].ravel()
      Out[477]: array([111, 112, 113, 121, 122, 123])
      

      在最后一个索引上取而代之的是从整个数组中选择4个值-我添加了..以突出显示该值

      Indexing instead on the last gives 4 values, selected from across the array - I've added .. to highlight that

      In [478]: X[:,:,0].ravel()
      Out[478]: array([111,.. 121,.. 211,.. 221])
      

      中间的索引为我提供了2个连续的子块,即2行X.

      Indexing on the middle gives me 2 contiguous subblocks, i.e. 2 rows of X.

      In [479]: X[:,0,:].ravel()
      Out[479]: array([111, 112, 113,.. 211, 212, 213])
      

      通过stridesshape计算,numpy可以(大约)同时访问X中的任何一个元素.在X[:,:,i]情况下,这就是必须要做的.这4个值在数据缓冲区中分散".

      With the strides and shape calculations numpy can access any one element in X in (about) the same time. And in the X[:,:,i] case that's what it has to do. The 4 values are 'scattered' across the databuffer.

      但是,如果它可以访问连续的块(例如,在X[i,:,:]中),则可以将更多操作委托给低级编译和处理器代码.使用X[:,i,:]时,这些块并没有那么大,但可能仍然足够大,可以产生很大的变化.

      But if it can access contiguous blocks, such as in X[i,:,:], it can delegate more of the action to low level compiled and processor code. With X[:,i,:] those blocks aren't quite as big, but may still be big enough to make a big difference.

      在您的测试用例中,[n,:,:]在512 * 512个元素块上迭代500次.

      In your test case, [n,:,:] iterates 500 times on 512*512 element blocks.

      [:,n,:]必须将该访问权限分成512个块,每个块512个.

      [:,n,:] has to divide that access into 512 blocks of 512 each.

      [:,:,n]必须进行500 x 512 x 512次单独访问.

      [:,:,n] has to do 500 x 512 x 512 individual accesses.

      我想知道使用uint16是否会夸大效果.在另一个问题中,我们仅显示了float16的计算要慢得多(最多10倍),因为处理器(和编译器)已调整为可以使用32位和64位数字.如果处理器被调整为可以移动64位数字的块,那么移动一个孤立的16位数字可能需要大量的额外处理.就像逐行从文档中复制n粘贴时,逐行复制每个副本所需的击键次数较少.

      I wonder if working with uint16 exaggerates the effect. In another question we just showed that calculation with float16 is much slower (up to 10x) because the processor (and compiler) is tuned to work with 32 and 64 bit numbers. If the processor is tuned to moving blocks of 64bit numbers around, then moving an isolated 16 bit number could require a lot of extra processing. It would be like doing a copy-n-paste from a document word-by-word, when copying line-by-line requires fewer key strokes per copy.

      确切的细节隐藏在处理器,操作系统和编译器以及numpy代码中,但是希望这能使您理解为什么中级情况比最坏情况更接近于最佳状态.

      The exact details are buried in the processor, operating system and compiler, as well as numpy code, but hopefully this gives a feel for why your middle case falls much closer to the optimum than to the worst case.

      在测试中-在所有情况下将imgs设置为a.dtype都会使速度降低一点.因此'uint16'不会引起任何特殊问题.

      On testing - setting imgs to a.dtype slows things down a bit for all cases. So the 'uint16' isn't causing any special problems.

      为什么`numpy.einsum`在`float32`上比在`float16`或`uint16`上工作更快?

      这篇关于与倒数第二个索引相比,最后一个索引对numpy数组的访问时间影响更大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆