卷积神经网络(CNN)用于音频 [英] Convolutional Neural Network (CNN) for Audio

查看:745
本文介绍了卷积神经网络(CNN)用于音频的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在关注DeepLearning.net上的教程,以学习如何实现从图像中提取特征的卷积神经网络.该教程说明得很好,易于理解和遵循.

I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.

我想扩展同一个CNN,以便同时从视频(图像+音频)中提取多模式特征.

I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.

我了解视频输入只不过是在与音频相关联的一段时间(例如30 FPS)中显示的一系列图像(像素强度).但是,我不太了解音频是什么,如何工作或如何分解为音频以馈入网络.

I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.

我已经阅读了几篇有关该主题的论文(多模式特征提取/表示),但是都没有解释如何将音频输入到网络.

I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.

此外,我从研究中了解到,多模式表示法是我们大脑真正工作的方式,因为我们没有刻意过滤掉自己的感官来获得理解.所有这些同时发生,而我们却没有通过(联合表示)知道它.一个简单的例子是,如果我们听到狮子的吼叫声,我们立即构想出狮子的心理形象,感到危险,反之亦然.大脑会激发出多种神经模式,以全面了解狮子的外观,声音,感觉,气味等.

Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.

上面提到的是我的最终目标,但是暂时为了简化我正在分解我的问题.

The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.

如果有人能够阐明音频的解剖方式,然后再在卷积神经网络中进行表示,我将非常感激.我还要感谢您在多模式同步,联合表示以及使用多模式数据训练CNN的正确方法方面的想法.

I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.

我发现音频可以用频谱图表示.它是音频的通用格式,并以具有两个几何尺寸的图形表示,其中水平线表示时间,垂直线表示频率.

I have found out the audio can be represented as spectrograms. It as a common format for audio and is represented as a graph with two geometric dimensions where the horizontal line represents time and the vertical represents frequency.

是否可以对这些频谱图上的图像使用相同的技术?换句话说,我可以简单地将这些频谱图用作卷积神经网络的输入图像吗?

Is it possible to use the same technique with images on these spectrograms? In other words can I simply use these spectrograms as input images for my convolutional neural network?

推荐答案

我们在声谱图上使用了深度卷积网络来进行口语识别任务.我们在此TopCoder竞赛中提供的数据集上具有约95%的准确性.详细信息是此处 .

We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.

普通卷积网络无法捕获时间特征,因此例如在这项工作中,卷积网络的输出被馈送到了一个延时神经网络.但是我们的实验表明,即使输入不多,卷积网络也可以在输入大小相似的情况下至少在某些任务上表现良好.

Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.

这篇关于卷积神经网络(CNN)用于音频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆