如何在傅立叶域中对长信号实现Pytorch 1D互相关? [英] How to implement Pytorch 1D crosscorrelation for long signals in fourier domain?

查看:76
本文介绍了如何在傅立叶域中对长信号实现Pytorch 1D互相关?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列信号长度n = 36,000,需要对其进行互相关.目前,我在numpy中执行cpu有点慢.我听说Pytorch可以大大加快张量操作,并提供了一种在GPU上并行执行计算的方法.我想探索这个选项,但是我不太确定如何使用框架来完成此操作.

由于这些信号的长度,我宁愿在频域中执行互相关运算.

通常使用numpy来执行以下操作:

 将numpy导入为npsignal_length = 36000#发出信号signal_1 = np.random.uniform(-1,1,signal_length)signal_2 = np.random.uniform(-1,1,signal_length)#输出互相关目标长度x_cor_sig_length = signal_length * 2-1#获取用于ftf计算的优化数组长度fast_length = np.fftpack.next_fast_len(x_cor_sig_length)#将数据移至频域.axis = -1执行#沿最后一个维度fft_1 = np.fft.rfft(src_data,fast_length,axis = -1)fft_2 = np.fft.rfft(src_data,fast_length,axis = -1)#取频谱之一的复共轭.您选择哪一种取决于特定于域的约定fft_1 = np.conj(fft_1)fft_multiplied = fft_1 * fft_2#返回时域.prelim_correlation = np.fft.irfft(结果,x_corr_sig_length,axis = -1)#移动信号以使其看起来像适当的互相关,#并将输出转换为纯实数final_result = np.real(np.fft.fftshift(prelim_correlation),axes = -1)).astype(np.float64) 

看一下Pytorch文档,似乎numpy.conj()并不等效.我也不确定在irfft操作之后是否/如何需要执行fftshift.

那么您将如何使用傅立叶方法在Pytorch中编写一维互相关?

解决方案

需要考虑的几件事.

Python解释器非常慢,那些矢量化库所做的就是将工作负载移至本机实现.为了产生任何变化,您需要能够在一条python指令中执行许多操作.在GPU上评估事物遵循相同的原理,尽管GPU具有更大的计算能力,但向/从GPU复制数据的速度较慢.

我修改了您的示例以同时处理多个信号.

 将numpy导入为np定义numpy_xcorr(BATCH = 1,signal_length = 36000):#发出信号signal_1 = np.random.uniform(-1,1,(BATCH,signal_length))signal_2 = np.random.uniform(-1,1,(BATCH,signal_length))#输出互相关目标长度x_cor_sig_length = signal_length * 2-1#获取用于ftf计算的优化数组长度fast_length = next_fast_len(x_cor_sig_length)#将数据移至频域.axis = -1执行#沿最后一个维度fft_1 = np.fft.rfft(signal_1,fast_length,axis = -1)fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1,fast_length,axis = -1)#取频谱之一的复共轭.fft_1 = np.conj(fft_1)fft_multiplied = fft_1 * fft_2#返回时域.prelim_correlation = np.fft.irfft(fft_multiplied,fast_length,axis = -1)#移位信号以使其看起来像适当的互相关,#并将输出转换为纯实数final_result = np.fft.fftshift(np.real(prelim_correlation),轴= -1)返回final_result,np.sum(final_result) 

自1.7以来,我们有了

让我惊讶的是当数字不是那么平滑时的结果,例如使用17平滑长度的割炬实现要好得多,以至于我在这里使用了对数刻度(批次大小为100时,割炬gpu比批次大小为1的numpy快10000倍).

请记住,这些功能通常是在GPU上生成数据,如果我们考虑将最终结果复制到CPU所花费的时间,我想将最终结果复制到CPU,我观察到的时间比互相关高10倍.计算(随机数据生成+三个FFT).

I have a series of signals length n = 36,000 which I need to perform crosscorrelation on. Currently, my cpu implementation in numpy is a little slow. I've heard Pytorch can greatly speed up tensor operations, and provides a way to perform computations in parallel on the GPU. I'd like to explore this option, but I'm not quite sure how to accomplish this using the framework.

Because of the length of these signals, I'd prefer to perform the crosscorrelation operation in the frequency domain.

Normally using numpy I'd perform the operation like so:

import numpy as np

signal_length=36000

# make the signals
signal_1 = np.random.uniform(-1,1, signal_length)
signal_2 = np.random.uniform(-1,1, signal_length)

# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1

# get optimized array length for fft computation
fast_length = np.fftpack.next_fast_len(x_cor_sig_length)

# move data into the frequency domain. axis=-1 to perform 
# along last dimension
fft_1 = np.fft.rfft(src_data, fast_length, axis=-1)
fft_2 = np.fft.rfft(src_data, fast_length, axis=-1)

# take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions
fft_1 = np.conj(fft_1)


fft_multiplied = fft_1 * fft_2

# back to time domain. 
prelim_correlation = np.fft.irfft(result, x_corr_sig_length, axis=-1)

# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real

final_result = np.real(np.fft.fftshift(prelim_correlation),axes=-1)).astype(np.float64)

Looking at the Pytorch documentation, there doesn't seem to be an equivalent for numpy.conj(). I'm also not sure if/how I need to implement a fftshift after the irfft operation.

So how would you go about writing a 1D crosscorrelation in Pytorch using the fourier method?

解决方案

A few things to be considered.

Python interpreter is very slow, what those vectorization libraries do is to move the workload to a native implementation. In order to make any difference you need to be able to give perform many operations in one python instruction. Evaluating things on GPU follows the same principle, while GPU has more compute power it is slower to copy data to/from GPU.

I adapted your example to process multiple signals simulataneously.

import numpy as np
def numpy_xcorr(BATCH=1, signal_length=36000):
    # make the signals
    signal_1 = np.random.uniform(-1,1, (BATCH, signal_length))
    signal_2 = np.random.uniform(-1,1, (BATCH, signal_length))

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length)

    # move data into the frequency domain. axis=-1 to perform 
    # along last dimension
    fft_1 = np.fft.rfft(signal_1, fast_length, axis=-1)
    fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1, fast_length, axis=-1)

    # take the complex conjugate of one of the spectrums. 
    fft_1 = np.conj(fft_1)


    fft_multiplied = fft_1 * fft_2

    # back to time domain. 
    prelim_correlation = np.fft.irfft(fft_multiplied, fast_length, axis=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = np.fft.fftshift(np.real(prelim_correlation), axes=-1)
    return final_result, np.sum(final_result)

Since torch 1.7 we have the torch.fft module that provides an interface similar to numpy.fft, the fftshift is missing but the same result can be obtained with torch.roll. Another point is that numpy uses by default 64-bit precision and torch will use 32-bit precision.

The fast length consists in choosing smooth numbers (those having that are factorized in to small prime numbers, and I suppose you are familiar with this subject).

def next_fast_len(n, factors=[2, 3, 5, 7]):
    '''
      Returns the minimum integer not smaller than n that can
      be written as a product (possibly with repettitions) of
      the given factors.
    '''
    best = float('inf')
    stack = [1]
    while len(stack):
        a = stack.pop()
        if a >= n:
            if a < best:
                best = a;
                if best == n:
                    break; # no reason to keep searching
        else:
            for p in factors:
                b = a * p;
                if b < best:
                    stack.append(b)
    return best;

Then the torch implementation goes

import torch;
import torch.fft
def torch_xcorr(BATCH=1, signal_length=36000, device='cpu', factors=[2,3,5], dtype=torch.float):
    signal_length=36000
    # torch.rand is random in the range (0, 1)
    signal_1 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)
    signal_2 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)

    # just make the cross correlation more interesting
    signal_2 += 0.1 * signal_1;

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length, [2, 3])

    # the last signal_ndim axes (1,2 or 3) will be transformed
    fft_1 = torch.fft.rfft(signal_1, fast_length, dim=-1)
    fft_2 = torch.fft.rfft(signal_2, fast_length, dim=-1)

    # take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions

    fft_multiplied = torch.conj(fft_1) * fft_2

    # back to time domain. 
    prelim_correlation = torch.fft.irfft(fft_multiplied, dim=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = torch.roll(prelim_correlation, (fast_length//2,))
    return final_result, torch.sum(final_result);

And here a code to test the results

import time
funcs = {'numpy-f64': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float64), 
         'numpy-f32': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float32), 
         'torch-cpu-f64': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float64), 
         'torch-cpu': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float32), 
         'torch-gpu-f64': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float64),
         'torch-gpu': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float32),
         }
times ={}
for batch in [1, 10, 100]:
    times[batch] = {}
    for l, f in funcs.items():
        t0 = time.time()
        t1, t2 = f(batch)
        tf = time.time()
        del t1
        del t2
        times[batch][l] = 1000 * (tf - t0) / batch;

I obtained the following results

And what surprised myself is the result when the numbers are not so smooth e.g. using 17-smooth length the torch implementation is so much better that I used logarithmic scale here (with batch size 100 the torch gpu was 10000 times faster than numpy with batch size 1).

Remember that these functions are generating the data at the GPU in general we want to copy the final results to the CPU, if we consider the time spent copying the final result to CPU I observed times up to 10x higher than the cross correlation computation (random data generation + three FFTs).

这篇关于如何在傅立叶域中对长信号实现Pytorch 1D互相关?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆