音频应如何进行预处理以进行分类? [英] How should audio be pre-processed for classification?

查看:325
本文介绍了音频应如何进行预处理以进行分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用TensorFlow的Python API开发音频分类器,使用UrbanSound8K数据集并尝试区分10个互斥类。

I am currently developing an audio classifier with the Python API of TensorFlow, using the UrbanSound8K dataset and trying to distinguish between 10 mutually exclusive classes.

音频文件是4秒长,包含176400个数据点,这会导致严重的内存问题。

The audio files are 4 seconds long and contain 176400 data points which results in serious memory issues. How should the audio be pre-processed to reduce memory usage?

又如何从音频中提取更多有用的功能(使用卷积和池化)?

And how can more useful features be extracted from the audio (using convolution and pooling)?

推荐答案

在声音分类方面,我个人更喜欢将频谱图作为神经网络的输入。这样,原始音频数据将转换为图像表示形式,您可以将其视为基本的图像分类任务。

I personally prefer spectrograms as input for neural nets when it comes to sound classification. This way, raw audio data is transformed into an image representation and you can treat it like a basic image classification task.

有很多方法可供选择,这里这是我通常使用scipy, python_speech_features pydub

There are a number of ways to choose from, here is what I usually do using scipy, python_speech_features and pydub:

import numpy as np
import scipy.io.wavfile as wave
import python_speech_features as psf
from pydub import AudioSegment

#your sound file
filepath = 'my-sound.wav'

def convert(path):

    #open file (supports all ffmpeg supported filetypes) 
    audio = AudioSegment.from_file(path, path.split('.')[-1].lower())

    #set to mono
    audio = audio.set_channels(1)

    #set to 44.1 KHz
    audio = audio.set_frame_rate(44100)

    #save as wav
    audio.export(path, format="wav")

def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):

    #open wav file
    (rate,sig) = wave.read(path)

    #get frames
    winfunc=lambda x:np.ones((x,))
    frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)

    #Magnitude Spectrogram
    magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))

    #noise reduction (mean substract)
    magspec -= magspec.mean(axis=0)

    #normalize values between 0 and 1
    magspec -= magspec.min(axis=0)
    magspec /= magspec.max(axis=0)

    #show spec dimensions
    print magspec.shape    

    return magspec

#convert file if you need to
convert(filepath)

#get spectrogram
spec = getSpectrogram(filepath)

首先,您需要根据采样率和声道对音频文件进行标准化。您可以使用出色的 pydub 包来做到这一点(以及更多)。

First, you need to standardize your audio files in terms of sample rate and channels. You can do that (and more) with the excellent pydub package.

之后,您需要使用以下方法将音频信号转换为图像FFT。您可以使用 scipy.io.wavefile python_speech_features 的sigproc模块来实现。我喜欢幅度谱图,将其旋转90度,对其进行归一化,然后将所得的NumPy数组用作卷积网络的输入。您可以通过调整 winstep NFFT 的值来更改频谱图的空间尺寸,以适合您的输入大小

After that, you need to transform your audio signal into an image with FFT. You can do that with scipy.io.wavefile and the sigproc modul of python_speech_features. I like the magnitude spectrogram, rotate it 90 degrees, normalize it and use the resulting NumPy array as input for my convnets. You can change the spatial dimensions of the spectrogram by adjusting the values of winstep and NFFT to fit your input size.

可能会有更简单的方法来完成所有这些工作;使用上面的代码,我获得了很好的总体分类结果。

There might be easier ways to do all that; I achieved good overall classification results using the code above.

这篇关于音频应如何进行预处理以进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆