如何将CNN应用于短时傅立叶变换? [英] How to apply CNN to Short-time Fourier Transform?

查看:161
本文介绍了如何将CNN应用于短时傅立叶变换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有一个返回.wav文件的短时傅立叶变换频谱的代码.我希望能够占用一毫秒的频谱,并在其上训练CNN.

So I have a code which returns a Short-Time Fourier Transform spectrum of a .wav file. I want to be able to take, say a millisecond of the spectrum, and train a CNN on it.

我不太确定该如何实现.我知道如何格式化图像数据以将其输入到CNN中,以及如何训练网络,但是我对如何获取FFT数据并将其划分为较小的时间段一无所知.

I'm not quite sure how I would implement that. I know how to format the image data to feed into the CNN, and how to train the network, but I'm lost on how to take the FFT-data and divide it into small time-frames.

FFT代码(对不起,超长代码):

The FFT Code(Sorry for ultra long code):

rate, audio = wavfile.read('scale_a_lydian.wav')

audio = np.mean(audio, axis=1)

N = audio.shape[0]
L = N / rate

M = 1024

# Audio is 44.1 Khz, or ~44100 samples / second
# window function takes 1024 samples or 0.02 seconds of audio (1024 / 44100 = ~0.02 seconds)
# and shifts the window 100 over each time
# so there would end up being (total_samplesize - 1024)/(100) total steps done (or slices)

slices = util.view_as_windows(audio, window_shape=(M,), step=100) #slices overlap

win = np.hanning(M + 1)[:-1]
slices = slices * win #each slice is 1024 samples (0.02 seconds of audio)

slices = slices.T #transpose matrix -> make each column 1024 samples (ie. make each column one slice)


spectrum = np.fft.fft(slices, axis=0)[:M // 2 + 1:-1] #perform fft on each slice and then take the first half of each slice, and reverse

spectrum = np.abs(spectrum) #take absolute value of slices

# take SampleSize * Slices
# transpose into slices * samplesize
# Take the first row -> slice * samplesize
# transpose back to samplesize * slice (essentially get 0.01s of spectrum)

spectrum2 = spectrum.T
spectrum2 = spectrum2[:1]
spectrum2 = spectrum2.T

以下输出FFT频谱:

N = spectrum2.shape[0]
L = N / rate

f, ax = plt.subplots(figsize=(4.8, 2.4))

S = np.abs(spectrum2)
S = 20 * np.log10(S / np.max(S))

ax.imshow(S, origin='lower', cmap='viridis',
          extent=(0, L, 0, rate / 2 / 1000))
ax.axis('tight')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
plt.show()

(随时纠正我在评论中提出的任何理论错误)

(Feel free to correct any theoretical errors that I put in the comments)

因此,据我了解,我有一个numpy数组(频谱),每一列都是一个包含510个样本的切片(切成两半,因为每个FFT切片的一半是多余的(没用?)),每个样本都有频率列表?

So from what I understand, I have a numpy array (spectrum) with each column being a slice with 510 samples (Cut in half, because half of each FFT slice is redundant (useless?)), with each sample having the list of frequencies?

上面的代码理论上将0.01s的音频作为频谱输出,这正是我所需要的.这是真的吗?还是我想的不对?

The above code theoretically outputs 0.01s of audio as a spectrum, which is exactly what I need. Is this true, or am I not thinking right?

推荐答案

我建议您使用 Librosa ,只需1行代码即可加载音频并进行一些预处理.您希望所有音频文件都具有相同的采样率.另外,您还希望将音频剪切到特定的部分以获得特定的间隔. 您可以像这样加载音频:

I would suggest you to use Librosa for loading the audio and doing some pre-processing in just 1 line of code. You would want all your audio files to have the same sampling rate. Also you'd like to cut the audio in a specific portion to get a specific interval. You can load the audio like this:

import librosa

y, sr = librosa.load(audiofile, offset=10.0, duration=30.0, sr=16000)

因此,您的时间序列将为y. 从这里,我将使用在声音的. 这家伙正在使用他自己的库来执行基于gpu的mel-specgramgram计算.您只需将y参数提供给网络.请参见此处. 或者,您可以删除该网络的第一层并预先计算您的Mel频谱图,然后将其保存在某处.这些将是您对网络的输入.请参见此处

So you'll have your time series as y. From here I would use this nice implementation of a CNN on audio. Here the guy is using his own library that performs on-gpu mel-spectrogram computation. You just need to give your y parameter to the network. See here how it's done. Alternatively, you can remove the first layer of that network and pre-compute your mel-spectrograms and save them somewhere. These would be your inputs to the network. See here

其他资源: 音频分类:卷积神经网络方法

这篇关于如何将CNN应用于短时傅立叶变换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆