累积相对于原点的滑动窗口 [英] Accumulate sliding windows relative to origin

查看:60
本文介绍了累积相对于原点的滑动窗口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个形状为 (3,3) 的数组 A ,它可以被认为是形状为 的未知数组的滑动窗口视图(5,).我想计算形状为 (5,) 的数组窗口的倒数.这个的伴随运算将是求和.我的意思是我想将每个对应窗口中的值与数组中的相关位置相加,形状为 (5,).当然,我这个反函数的预期输出和输入 A 没有关系,只是普通的数组.我有两个例子,希望能更好地解释这一点.

A = np.array([[0, 0, 1],[0, 0, 1],[0, 0, 1]], dtype=np.float32)

我期望这个输出:

np.array([0, 0, 1, 1, 1])

另一个例子:

A = np.array([[1, 2, 3],[2, 3, 4],[3, 4, 5]], dtype=np.float32)

我期望这个输出:

np.array([1, 2+2, 3+3+3, 4+4, 5]) = np.array([1, 4, 9, 8, 5])

我的解决方案很慢(结果存储在 out 中)

out = np.zeros(5, dtype=np.float32)windows = np.lib.stride_tricks.as_strided(out, shape=(3,3), strides=(4,4))对于 np.ndindex(windows.shape) 中的 i:窗户[i] += A[i]

编写跨步视图感觉有点麻烦,我相信有更好的解决方案.

有没有办法以矢量化的方式编写它,而无需 for 循环?(这也适用于多个维度)

编辑

就更高维度的泛化而言,我有一些情况,其中窗口是从图像(二维数组)中获取的,而不是像上面的示例那样的一维数组.对于二维情况,A 可以是例如3 大小的窗口.这意味着从形状为 (4,4) 的图像(输出)来看,windows A 将具有形状 (2,2,3,3).

A = np.array([[[[0, 0, 0],[0, 1, 0],[0, 0, 0]],[[0, 0, 0],[1, 0, 0],[0, 0, 0]]],[[[0, 1, 0],[0, 0, 0],[0, 0, 0]],[[1, 0, 0],[0, 0, 0],[0, 0, 0]]]], dtype=np.float32)

使用Pablo给出的解决方案,出现以下错误

形状 (2,2,3,3) 的值数组无法广播到形状 (2,2) 的索引结果

使用稍微修改过的我的 stride 解决方案:

def inverse_sliding_windows(A, window_sz, image_sz):out = np.zeros(image_sz, dtype=np.float32)windows = np.lib.stride_tricks.sliding_window_view(out, window_sz, writeable=True)对于 np.ndindex(windows.shape) 中的 i:窗户[i] += A[i]window_sz = (3,3)image_sz = (4,4)inverse_sliding_windows(A, window_sz, image_sz)

输出:

array([[0., 0., 0., 0.],[0., 4., 0., 0.],[0., 0., 0., 0.],[0., 0., 0., 0.]], dtype=float32)

为了澄清,窗口大小和输出形状是事先知道的,请参阅inverse_sliding_windows.

解决方案

正如我在评论中提到的,矢量化解决方案并不总能保证更好的运行时间.如果您的矩阵很大,您可能更喜欢更有效的方法.矩阵旋转速度慢的原因有很多(虽然很直观),请参阅评论.

性能对比:

解决方案:挂墙时间:61.6 ms旋转:挂墙时间:3.32 秒

代码(在 jupyter notebook 中测试)

将 numpy 导入为 npdef rotate45_and_sum(A):n = len(A)x, y = np.meshgrid(np.arange(n), np.arange(n)) # 运行时间至少翻倍xn, yn = x + y, n - x + y - 1 # 生成 xn 和 yn 至少使运行时间翻倍M = np.zeros((2*n -1, 2*n -1)) # 至少将运行时间减慢 4 倍M[xn,yn] = A[x,y] # 非常低效的索引策略返回 M.sum(1)定义解决方案(A):n = A.shape[0]retval = np.zeros(2*n-1)对于范围(n)中的我:retval[i:(i+n)] += A[i, :]返回值A = np.random.randn(10000, 10000)%time 解决方案(A)%time rotate45_and_sum(A)


在多维情况下:

def 解决方案(A):h,w,x,y = A.shape # 在这里更改retval = np.zeros((2*x-w,2*y-h)) # 在这里改index = np.ndindex(w, h) # 在这里改对于索引中的索引:切片 = 元组()对于范围内的我(len(索引)):slices = slices + (slice(index[i], index[i]+x),) # 我假设 x = y = ...,如果假设不正确,你也需要在这里更改retval[slices] += A[index] # slices 在你的代码中大致等于 `i:(i+x), j:(j+y)`返回值

实际上我不知道如何根据您的描述计算尺寸(或形状):(.但我认为它可以概括.这个想法是在您进行时构建 slices. 所以需要指定哪些维度对应h,w,哪些维度对应x,y.我觉得不难做到.

参考:

fast 中并行化for 循环很简单.但 fast 实际上是最高效的缓存(即使对于 GPU 缓存和内存组),因此也是最快的计算方式.理想情况下,您可以使用 CUDA/OpenCL 并行化代码,因为 GPU 中有更多内核.如果你做对了,运行时间将减少到 log(original_fast_time) 与 base k,其中 k 是你的核心数有.

然而,函数中只有很少的计算.因此,内存和 GRAM 之间的数据传输可能占主导地位.(我没有测试过)

I have an array A with the shape (3,3) which can be thought of as the sliding window view of an unkown array with the shape (5,). I want to compute the inverse of windowing the array with the shape (5,). The adjoint operation of this will be summation. What I mean is that I want to accumulate the values in each corresponding window with the related position in the array with the shape (5,). Ofcourse, my expected output of this inverse function and the input A are not related and are just ordinary arrays. I have two examples which I hope explains this better.

A = np.array([[0, 0, 1],
              [0, 0, 1],
              [0, 0, 1]], dtype=np.float32)

I expect this output:

np.array([0, 0, 1, 1, 1])

The other example:

A = np.array([[1, 2, 3],
              [2, 3, 4],
              [3, 4, 5]], dtype=np.float32)

I expect this output:

np.array([1, 2+2, 3+3+3, 4+4, 5]) = np.array([1, 4, 9, 8, 5])

The solution I have which is quite slow (result stored in out)

out = np.zeros(5, dtype=np.float32)
windows = np.lib.stride_tricks.as_strided(out, shape=(3,3), strides=(4,4))
for i in np.ndindex(windows.shape):
  windows[i] += A[i]

Writing to a strided view feels a bit hacky and I am sure there is a better solution.

Is there any way to write this in a vectorized manner, without the for-loop? (which also generalizes for multiple dimensions)

EDIT

In terms of generalizing for higher dimensions, I have cases where the windows are taken from an image (2d array), instead of a 1d array like the example above. For the 2d case, A can for example be windows of size 3. This means that from an image (output) with the shape (4,4), The windows A will have the shape (2,2,3,3).

A = np.array([[[[0, 0, 0],
                [0, 1, 0],
                [0, 0, 0]],

               [[0, 0, 0],
                [1, 0, 0],
                [0, 0, 0]]],


              [[[0, 1, 0],
                [0, 0, 0],
                [0, 0, 0]],

               [[1, 0, 0],
                [0, 0, 0],
                [0, 0, 0]]]], dtype=np.float32)

Using the solution given by Pablo, I get the following error

value array of shape (2,2,3,3)  could not be broadcast to indexing result of shape (2,2)

Using a slightly modified version of my stride solution:

def inverse_sliding_windows(A, window_sz, image_sz):
  out = np.zeros(image_sz, dtype=np.float32)
  windows = np.lib.stride_tricks.sliding_window_view(out, window_sz, writeable=True)
  for i in np.ndindex(windows.shape):
    windows[i] += A[i]

window_sz = (3,3)
image_sz = (4,4)
inverse_sliding_windows(A, window_sz, image_sz)

Output:

array([[0., 0., 0., 0.],
       [0., 4., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32)

To clarify, the window size and output shape is known beforehand, see inverse_sliding_windows.

解决方案

As I mentioned in the comment, a vectorized solution doesn't always guarantee a better running time. If your matrix is large, you might prefer more efficient methods. And there are several reasons why matrix rotation is slow (though, intuitive), see comment.

Performance comparison:

Solution: Wall time: 61.6 ms
Rotation: Wall time: 3.32 s

Code (tested in jupyter notebook)

import numpy as np

def rotate45_and_sum(A):
    n = len(A) 
    x, y = np.meshgrid(np.arange(n), np.arange(n))  # at least doubled the running time
    xn, yn = x + y, n - x + y - 1   # generating xn and yn at least doubled the running time
    M = np.zeros((2*n -1, 2*n -1))  # at least slows down running time by a factor of 4
    M[xn,yn] = A[x,y] # very inefficient indexing strategy
    return M.sum(1)

def solution(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[i:(i+n)] += A[i, :]
    return retval

A = np.random.randn(10000, 10000)

%time solution(A)

%time rotate45_and_sum(A)


In multidimensional situation:

def solution(A):
    h,w,x,y = A.shape                # change here
    retval = np.zeros((2*x-w,2*y-h)) # change here
    indices = np.ndindex(w, h)       # change here
    for index in indices:
        slices = tuple()
        for i in range(len(index)):
            slices = slices + (slice(index[i], index[i]+x),) # I assume x = y = ..., you need to change here also if the assumption is not correct
        retval[slices] += A[index] # slices is roughly equal `i:(i+x), j:(j+y)` in your code
    return retval

Actually I don't know the how the dimensions (or shapes) are calculated based on your description :(. But i think it could be generalized. The idea is to construct slices as you go. So you need to specify which dimensions correspond to h, w, which correspond to x, y. I think it's not difficult to do that.

Reference: Numpy index array of unknown dimensions?


Regarding https://stackoverflow.com/a/67341994/14923227


def fast(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[i:(i+n)] += A[i, :]
    print(retval.sum())
    return retval

##########################
import threading

class sumThread(threading.Thread):
    def __init__(self, A, mat, threadID, ngroups, size):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.size = size
        self.ngroups = ngroups
        self.mat = mat
        self.A = A
    def run(self):
        begin = (self.size + self.ngroups) // self.ngroups * self.threadID
        end   = min(self.size, (self.size+self.ngroups)//self.ngroups*(self.threadID+1))
        for i in range(begin, end):
            self.mat[self.threadID, i:(i+self.size)] += self.A[i, :]

def faster(A):
    
    num_threads = max(1, A.shape[0] // 4000) 
    mat = np.zeros((num_threads, 2*A.shape[0]-1))
    threads = []
    for i in range(num_threads):
        t = sumThread(A, mat, i, num_threads, A.shape[0])
        t.start()
        threads.append(t)

    # Wait for all threads to complete
    for t in threads:
        t.join()
    return np.sum(mat, axis=0)
    

Performance for large array:

A = np.random.randn(20000,20000)
%timeit fast(A)   # 263 ms ± 5.21 ms per loop 
%timeit faster(A) # 155 ms ± 3.14 ms per loop

It's trivial to parallelize the for loop in fast. But fast is actually the most cache efficient (even for GPU cache and memory banks) and thus the fastest way to compute it. Ideally, you can parallelize the code with CUDA/OpenCL since there are way more cores in a GPU. If you do it correctly, the running time will be reduced to log(original_fast_time) with base k, where k is the number of cores you have.

However, there are only a few computations in the function. So the transportation of data between memory and GRAM might dominate. (I didn't test it)

这篇关于累积相对于原点的滑动窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆