用于音频的卷积神经网络 (CNN) [英] Convolutional Neural Network (CNN) for Audio

查看:38
本文介绍了用于音频的卷积神经网络 (CNN)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在关注 DeepLearning.net 上的教程,以了解如何实现从图像中提取特征的卷积神经网络.教程解释得很好,易于理解和遵循.

I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.

我想扩展相同的 CNN 以同时从视频(图像 + 音频)中提取多模态特征.

I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.

我了解视频输入只不过是在与音频相关的一段时间(例如 30 FPS)内显示的一系列图像(像素强度).但是,我真的不明白音频是什么、它是如何工作的,或者它是如何分解以输入网络的.

I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.

我已经阅读了几篇关于该主题的论文(多模态特征提取/表示),但没有人解释音频是如何输入到网络中的.

I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.

此外,我从我的研究中了解到,多模态表示是我们大脑真正工作的方式,因为我们不会故意过滤掉我们的感官来实现理解.这一切都同时发生,而我们通过(联合表示)不知道它.一个简单的例子是,如果我们听到狮子的吼声,我们会立即在脑海中形成狮子的形象,感到危险,反之亦然.多种神经模式在我们的大脑中被激发,以全面了解狮子的样子、声音、感觉、气味等.

Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.

上面提到的是我的最终目标,但为了简单起见,我暂时将我的问题分解.

The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.

如果有人能阐明音频是如何被解剖的,然后在卷积神经网络中表现出来,我将不胜感激.我也很感激您对多模态同步、联合表示以及使用多模态数据训练 CNN 的正确方法的想法.

I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.

我发现音频可以表示为频谱图.它是音频的常用格式,并表示为具有两个几何维度的图形,其中水平线代表时间,垂直线代表频率.

I have found out the audio can be represented as spectrograms. It as a common format for audio and is represented as a graph with two geometric dimensions where the horizontal line represents time and the vertical represents frequency.

是否可以对这些频谱图上的图像使用相同的技术?换句话说,我可以简单地将这些频谱图用作卷积神经网络的输入图像吗?

Is it possible to use the same technique with images on these spectrograms? In other words can I simply use these spectrograms as input images for my convolutional neural network?

推荐答案

我们在声谱图上使用深度卷积网络进行口语识别任务.我们对 本次 TopCoder 比赛中提供的数据集的准确率约为 95%.详情在这里.

We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.

普通卷积网络不能捕捉时间特征,例如 在这项工作中 卷积网络的输出被馈送到一个延时神经网络.但是我们的实验表明,即使没有额外的元素,卷积网络在输入大小相似的情况下至少可以在某些任务上表现良好.

Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.

这篇关于用于音频的卷积神经网络 (CNN)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆